AWS Scan ignores withLimit() - amazon-dynamodb

I am trying to fetch the items from a DynamoDB table to put them in a csv file. Following is the code:
ArrayList<String> ids = new ArrayList<String>();
ScanResult result = null;
do{
ScanRequest req = new ScanRequest();
req.setTableName("table");
req.withLimit(10);
if(result != null){
req.setExclusiveStartKey(result.getLastEvaluatedKey());
}
AmazonDynamoDBClient client = new AmazonDynamoDBClient(awsCreds);
result = client.scan(req);
List<Map<String, AttributeValue>> rows = result.getItems();
for(Map<String, AttributeValue> map : rows){
try{
AttributeValue v = map.get("prod_number");
String id = v.getS();
ids.add(id);
} catch (NumberFormatException e){
System.out.println(e.getMessage());
}
}
} while(result.getLastEvaluatedKey() != null);
System.out.println("Result size: " + ids.size());
I want to know why 'req.withLimit(10)' has no impact on the number of results. The query still tries to fetch all the records.

The limit property of ScanRequest means:
The maximum number of items to evaluate (not necessarily the number of matching items). If DynamoDB processes the number of items up to the limit while processing the results, it stops the operation and returns the matching values up to that point, and a key in LastEvaluatedKey to apply in a subsequent operation, so that you can pick up where you left off. Also, if the processed dataset size exceeds 1 MB before DynamoDB reaches this limit, it stops the operation and returns the matching values up to the limit, and a key in LastEvaluatedKey to apply in a subsequent operation to continue the operation. For more information, see Working with Queries in the Amazon DynamoDB Developer Guide.
So, it limits only the size of a portion of data returned by a single request, but not the whole scan operation. And I see you're doing multiple requests, so you'll get more data.

Related

Number of records in grid AX 2012

I tried to count num of rows in grid in runtime with this code
FormRun caller;
FormDataSource fds;
QueryRun queryRun;
int64 rows;
fds = caller.dataSource();
query = fds.query();
queryRun = new QueryRun(query);
rows = SysQuery::countTotal(queryRun); //this returns -1587322268
rows = SysQuery::countLoops(queryRun); //this returs 54057
The last line of code is closest to what i need because there are 54057 lines but if i add filters it still returns 54057.
I want logic to get the number rows that grid has in the moment of calling the method.
Your query has more than one datasource.
The best way to explain your observation is to look at the implementation of countTotal and countLoops.
public client server static Integer countTotal(QueryRun _queryRun)
{
container c = SysQuery::countPrim(_queryRun.pack(false));
return conpeek(c,1);
}
public client server static Integer countLoops(QueryRun _queryRun)
{
container c = SysQuery::countPrim(_queryRun.pack(false));
return conpeek(c,2);
}
private server static container countPrim(container _queryPack)
{
...
if (countQuery.dataSourceCount() == 1)
qbds.addSelectionField(fieldnum(Common,RecId),SelectionField::Count);
countQueryRun = new QueryRun(countQuery);
while (countQueryRun.next())
{
common = countQueryRun.get(countQuery.dataSourceNo(1).table());
counter += common.RecId;
loops++;
}
return [counter,loops];
}
If your datasource contains one datasource it adds count(RecId).
countTotal returns the number of records.
countLoops returns 1.
Pretty fast, as fast as the SQL allows.
If your datasource contains more than one datasource it does not add count(RecId).
countTotal returns the sum of recIds (makes no sense).
countLoops returns the number of records.
Also countLoops is slow if there are many records as they are counted one by one.
If you have two datasources and want a fast count, you are on your own:
fds = caller.dataSource();
queryRun = new QueryRun(fds.queryRun().query());
queryRun.query().dataSourceNo(2).joinMode(JoinMode::ExistsJoin);
queryRun.query().dataSourceNo(1).clearFields();
queryRun.query().dataSourceNo(1).addSelectionField(fieldnum(Common,RecId),SelectionField::Count);
queryRun.next();
rows = queryRun.getNo(1).RecId;
The reason your count did not respect the filters was because you used datasource.query() rather than datasource.queryRun().query(). The former is the static query, the latter is the dynamic query with user filters included.
Update, found some old code with a more general approach:
static int tableCount(QueryRun _qr)
{
QueryRun qr;
Query q = new Query(_qr.query());
int dsN = _qr.query().dataSourceCount();
int ds;
for (ds = 2; ds <= dsN; ++ds)
{
if (q.dataSourceNo(ds).joinMode() == JoinMode::OuterJoin)
q.dataSourceNo(ds).enabled(false);
else if (q.dataSourceNo(ds).joinMode() == JoinMode::InnerJoin)
{
q.dataSourceNo(ds).joinMode(JoinMode::ExistsJoin);
q.dataSourceNo(ds).fields().clearFieldList();
}
}
q.dataSourceNo(1).fields().clearFieldList();
q.dataSourceNo(1).addSelectionField(fieldNum(Common,RecId), SelectionField::Count);
qr = new QueryRun(q);
qr.next();
return any2int(qr.getNo(1).RecId);
}

How to write 5000 records into DynamoDB Table?

I have a use case where I have to write 5000 records into dynamoDB table in one shot. I am using batchSave api of DynamoDBMapper Library. it can write upto 25 records in one go.
can I pass the list of 5000 records to it and it will internally convert them into batch of 25 records and write to dynamodb table or I will have to handle this thing in my code using conditional some logic and will pass only 25 records to batchSave?
According to the batchSave documentation, batchSave():
Saves the objects given using one or more calls to the AmazonDynamoDB.batchWriteItem
Indeed, it splits up the items you give it into appropriately-sized batches (25 items) and writes them using the DynamoDB BatchWriteItem operation.
You can see the code that does this in batchWrite() in DynamoDBMapper.java:
/** The max number of items allowed in a BatchWrite request */
static final int MAX_ITEMS_PER_BATCH = 25;
// Break into chunks of 25 items and make service requests to DynamoDB
for (final StringListMap<WriteRequest> batch :
requestItems.subMaps(MAX_ITEMS_PER_BATCH, true)) {
List<FailedBatch> failedBatches = writeOneBatch(batch, config.getBatchWriteRetryStrategy());
...
Here are the methods I use in order to achieve this end. I manage to do it, by first chucking the dataArray into small arrays (of length 25):
const queryChunk = (arr, size) => {
const tempArr = []
for (let i = 0, len = arr.length; i < len; i += size) {
tempArr.push(arr.slice(i, i + size));
}
return tempArr
}
const batchWriteManyItems = async (tableName, itemObjs, chunkSize = 25) => {
return await Promise.all(queryChunk(itemObjs, chunkSize).map(async chunk => {
await dynamoDB.batchWriteItem({RequestItems: {[tableName]: chunk}}).promise()
}))
}

How does MaxItemCount from FeedOption and RetrievedDocumentCount from QueryMetric works in Cosmos DB and why both never match?

I am currently facing query performance issue with Cosmos DB and I am quite sure I have followed most of the performance tips from Microsoft page but still query takes > 1 second.
Connection policy
private static readonly ConnectionPolicy ConnectionPolicy = new ConnectionPolicy
{
ConnectionMode = ConnectionMode.Direct,
ConnectionProtocol = Protocol.Tcp,
RequestTimeout = new TimeSpan(1, 0, 0),
MaxConnectionLimit = 1000,
RetryOptions = new RetryOptions
{
MaxRetryAttemptsOnThrottledRequests = 10,
MaxRetryWaitTimeInSeconds = 60
}
};
Document Client
this.Client = new DocumentClient(new Uri(config.DocumentDBURI), config.DocumentDBKey, ConnectionPolicy);
Document Query
FeedOptions options = new FeedOptions
{
MaxItemCount = config.getSearchLimit,//// which is 100
PartitionKey = new PartitionKey(partitionKey),
RequestContinuation = responseContinuation
};
var documentQuery = Client.CreateDocumentQuery<SearchByAttributesResult>(
this.TenantCollectionUri,
querySpec,
options).AsDocumentQuery();
Query 1
SELECT p.Doc.id, p.Doc.Name, p.Doc.isOrganization,p.Doc.organizationLegalName, p.Doc.isFactoryAutoUpdate,p.Doc.StartDate, p.Doc.EndDate, p.Doc.InactiveReasonCode,p.Doc.Specialty.specialty AllSpecialty, Address from p JOIN Address IN p.Doc.Address.address WHERE (p.Doc.EndDate = null or (p.Doc.StartDate <= #STARTDATE and p.Doc.EndDate >= #ENDDATE)) and CONTAINS(p.Doc.Name, #PROVIDERNAME) and Address.alpha2Code= #ALPHA2CODE
Query 2
SELECT p.Doc.id, p.Doc.Name, p.Doc.isOrganization,p.Doc.organizationLegalName, p.Doc.isFactoryAutoUpdate,p.Doc.StartDate, p.Doc.EndDate, p.Doc.InactiveReasonCode,p.Doc.Specialty.specialty AllSpecialty, Address from p JOIN Address IN p.Doc.Address.address WHERE (p.Doc.EndDate = null or (p.Doc.StartDate <= #STARTDATE and p.Doc.EndDate >= #ENDDATE)) and STARTSWITH(Address.postalCode, #POSTALCODE) and Address.alpha2Code= #ALPHA2CODE
above query changes based on user search condition
I have only 900 documents in my collection but still query takes > 1 seconds always.
trying to understand few points here
Though I set MaxItemCount to 100 why I am seeing RetrievedDocumentCount from QueryMetrics as 900?
use of CONTAINS/STARTSWITH causing this performance issue?
What's wrong I am doing here and how can i improve this query performance into sub-seconds ( <.5s)
First things first, MaxItemCount doesn't mean that you will get the top 100 documents.
It means that every iteration of ExecuteNextAsync will return up to 100 documents at a time, but up to everything that matches this query.
If you want to limit your results to the top 100 then, in LINQ use the .Take(100) method before you use AsDocumentQuery or in SQL use the TOP keyword.
In terms of performance, it's bad for three reasons.
Checking for records between range of dates
You are using the CONTAINS/STARTSWITH function.
You are joining
At this point, if changing the schema isn't an option, I would recommend reading more about Indexing and optimising it based on the querying requirements of your application.

DocumentDB Change Feed and saving Checkpoint

After reading the documentation, I'm having a hard time conceptualizing the change feed. Let's take the code from the documentation below. The second change feed is picking up the changes from the last time it was run via the checkpoints. Let's say it is being used to create summary data and there was an issue and it needed to be re-run from a prior time. I don't understand the following:
How to specify a particular time the checkpoint should start. I understand I can save the checkpoint dictionary and use that for each run, but how do you get the changes from X time to maybe rerun some summary data
Secondly, let's say we are rerunning some summary data and we save the last checkpoint used for each summarized data so we know where that one left off. How does one know that a record is in or before that checkpoint?
Code that runs from collection beginning and then from last checkpoint:
Dictionary < string, string > checkpoints = await GetChanges(client, collection, new Dictionary < string, string > ());
await client.CreateDocumentAsync(collection, new DeviceReading {
DeviceId = "xsensr-201", MetricType = "Temperature", Unit = "Celsius", MetricValue = 1000
});
await client.CreateDocumentAsync(collection, new DeviceReading {
DeviceId = "xsensr-212", MetricType = "Pressure", Unit = "psi", MetricValue = 1000
});
// Returns only the two documents created above.
checkpoints = await GetChanges(client, collection, checkpoints);
//
private async Task < Dictionary < string, string >> GetChanges(
DocumentClient client,
string collection,
Dictionary < string, string > checkpoints) {
List < PartitionKeyRange > partitionKeyRanges = new List < PartitionKeyRange > ();
FeedResponse < PartitionKeyRange > pkRangesResponse;
do {
pkRangesResponse = await client.ReadPartitionKeyRangeFeedAsync(collection);
partitionKeyRanges.AddRange(pkRangesResponse);
}
while (pkRangesResponse.ResponseContinuation != null);
foreach(PartitionKeyRange pkRange in partitionKeyRanges) {
string continuation = null;
checkpoints.TryGetValue(pkRange.Id, out continuation);
IDocumentQuery < Document > query = client.CreateDocumentChangeFeedQuery(
collection,
new ChangeFeedOptions {
PartitionKeyRangeId = pkRange.Id,
StartFromBeginning = true,
RequestContinuation = continuation,
MaxItemCount = 1
});
while (query.HasMoreResults) {
FeedResponse < DeviceReading > readChangesResponse = query.ExecuteNextAsync < DeviceReading > ().Result;
foreach(DeviceReading changedDocument in readChangesResponse) {
Console.WriteLine(changedDocument.Id);
}
checkpoints[pkRange.Id] = readChangesResponse.ResponseContinuation;
}
}
return checkpoints;
}
DocumentDB supports check-pointing only by the logical timestamp returned by the server. If you would like to retrieve all changes from X minutes ago, you would have to "remember" the logical timestamp corresponding to the clock time (ETag returned for the collection in the REST API, ResponseContinuation in the SDK), then use that to retrieve changes.
Change feed uses logical time in place of clock time because it can be different across various servers/partitions. If you would like to see change feed support based on clock time (with some caveats on skew), please propose/upvote at https://feedback.azure.com/forums/263030-documentdb/.
To save the last checkpoint per partition key/document, you can just save the corresponding version of the batch in which it was last seen (ETag returned for the collection in the REST API, ResponseContinuation in the SDK), like Fred suggested in his answer.
How to specify a particular time the checkpoint should start.
You could try to provide a logical version/ETag (such as 95488) instead of providing a null value as RequestContinuation property of ChangeFeedOptions.

SQLite storage API Insert statement freezes entire firefox in bootstrapped(Restartless) AddOn

Data to be inserted has just two TEXT columns whose individual length don't even exceed 256.
I initially used executeSimpleSQL since I didn't need to get any results.
It worked for simulataneous inserts of upto 20K smoothly i.e. in the bakground no lag or freezing observed.
However, with 0.1 million I could see horrible freezing during insertion.
So, I tried these two,
Insert in chunks of 500 records - This didn't work well since even for 20K records it showed visible freezing. I didn't even try with 0.1million.
So, I decided to go async and used executeAsync alongwith Bind etc. This also shows visible freezing for just 20K records. This was the whole array being inserted and not in chunks.
var dirs = Cc["#mozilla.org/file/directory_service;1"].
getService(Ci.nsIProperties);
var dbFile = dirs.get("ProfD", Ci.nsIFile);
var dbService = Cc["#mozilla.org/storage/service;1"].
getService(Ci.mozIStorageService);
dbFile.append('mydatabase.sqlite');
var connectDB = dbService.openDatabase(dbFile);
let insertStatement = connectDB.createStatement('INSERT INTO my_table
(my_col_a,my_col_b) VALUES
(:myColumnA,:myColumnB)');
var arraybind = insertStatement.newBindingParamsArray();
for (let i = 0; i < my_data_array.length; i++) {
let params = arraybind.newBindingParams();
// Individual elements of array have csv
my_data_arrayTC = my_data_array[i].split(',');
params.bindByName("myColumnA", my_data_arrayTC[0]);
params.bindByName("myColumnA", my_data_arrayTC[1]);
arraybind.addParams(params);
}
insertStatement.bindParameters(arraybind);
insertStatement.executeAsync({
handleResult: function(aResult) {
console.log('Results are out');
},
handleError: function(aError) {
console.log("Error: " + aError.message);
},
handleCompletion: function(aReason) {
if (aReason != Components.interfaces.mozIStorageStatementCallback.REASON_FINISHED)
console.log("Query canceled or aborted!");
console.log('We are done inserting');
}
});
connectDB.asyncClose(function() {
console.log('[INFO][Write Database] Async - plus domain data');
});
Also, I seem to get the async callbacks after a long time. Usually, executeSimpleSQL is way faster than this.If I use SQLite Manager Tool extension to open the DB immediately this is what I get ( as expected )
SQLiteManager: Error in opening file mydatabase.sqlite - either the file is encrypted or corrupt
Exception Name: NS_ERROR_STORAGE_BUSY
Exception Message: Component returned failure code: 0x80630001 (NS_ERROR_STORAGE_BUSY) [mozIStorageService.openUnsharedDatabase]
My primary objective was to dump data as big as 0.1 million + and then later on perform reads when needed.

Resources