Retrieve huge amount of data in Web API - asp.net

I have a web api which allows the user to retrieve data real time and by batch. The problem however is that I have a data which has over 30 million of records which causes a bottleneck in a by batch request. I use paging in my get method which returns a default of 10 records per api request but still the 10 records take time in retrieving because of the bulk pull of data.
Here is the sample of my get method :
public async Task<IHttpActionResult> Get(int pageno = 1, int pagesize = 10)
{
int skip = (pageno - 1) * pagesize;
int total = db.webapi_customer_charges.Count();
var cc = await db.webapi_customer_charges
.OrderBy(c => c.hospital_number)
.Skip(skip)
.Take(pagesize)
.ToListAsync();
return Ok(new Paging<webapi_customer_charges>(cc, pageno, pagesize, total));
}
Is there a way or workaround or like best practice when it comes to retrieval of huge amount of data ? Thank you.

Not sure how you can do it with EF, since I assume it retrieves the whole set before you can skip or take, but I think it would be faster if you called a stored proc on Sql Server side and just pass in the min and max row numbers. That way you would only get the amount of data you need from the server on each call.
Here is a link to how you can call a stored proc with EF. If you don't find a better solution give it try it should be quite simple to make a select and use the ROW_NUMBER() in Sql to filter based on your input params.

Related

DynamoDB how to get items count for a partition keys using .net core?

How can I get items count for a particular partition key using .net core preferably using Object Persistence Interface or Document Interfaces?
Since I do not see any docs any where, currently I get the number of items count by retrieve all the item and get its count, but it is very expensive to do the reads.
What is the best practices for such item count request? Thank you.
dynamodb is mostly a document oriented key-value db; so its not optimized for functionality of the common relation db functions (like item count).
to minimize the data that is transmitted and to improve speed you may want to do the following:
Create Lambda Function that returns Item Count
To avoid transmitting data outside of AWS; which is slow and expensive.
query options
use only keys in your projection-expression,
reducing the data that is transmitted from db
max page-size, reducing number of calls needed
Stream Option
Streams could also be used for keeping counts; e.g. as described in
https://medium.com/signiant-engineering/real-time-aggregation-with-dynamodb-streams-f93547cfb244
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-aggregation.html
Related SO Question
Complexity of finding total records count with partition key in nosql dynamodb table?
I just realized that using low level interface in QueryRequest one can set Select = "COUNT" then when calling QueryAsync() orQuery() will return the count only as a integer only. Please refer to code sample below.
private static QueryRequest getStockRecordCountQueryRequest(string tickerSymbol, string prefix)
{
string partitionName = ":v_PartitionKeyName";
string sortKeyPrefix = ":v_sortKeyPrefix";
var request = new QueryRequest
{
TableName = Constants.TableName,
ReturnConsumedCapacity = ReturnConsumedCapacity.TOTAL,
Select = "COUNT",
KeyConditionExpression = $"{Constants.PartitionKeyName} = {partitionName} and begins_with({Constants.SortKeyName},{sortKeyPrefix})",
ExpressionAttributeValues = new Dictionary<string, AttributeValue>
{
{ $"{partitionName}", new AttributeValue {
S = tickerSymbol
}},
{ $"{sortKeyPrefix}", new AttributeValue {
S = prefix
}}
},
// Optional parameter.
ConsistentRead = false,
ExclusiveStartKey = null,
};
return request;
}
but I would like to point out that this still will consumed the same read units as retrieving all the item and get its count by yourself. but since it is only returning the count as an integer, it is a lot more efficient then transmitting the entire items list cross the wire.
I think using DynamoDB Streams in a more proper way to get the counts for large project. It is just a lot more complicated to implement.

How to get live table row count in DynamoDb programmatically in ASP.NET Core?

I wrote something like below code in my ASP.NET WEB API. I want to get the live row count t0 display in my application. The problem with the below code is it's showing Scanned count as 7134. But actual value is in millions.
var cancellationToken = new CancellationToken();
AmazonDynamoDBClient client = new AmazonDynamoDBClient();
var request = new ScanRequest
{
TableName = "exampleTable",
Select = Select.COUNT
};
var response = client.ScanAsync(request, cancellationToken).Result;
var totalCount = response.Count.ToString();
return totalCount;
What do you mean by 'live' row count?
Also, if you have only a few pieces of code that insert into your Dynamo table you could insert a record specifically for maintaining count, and then update this record's counter whenever inserting / deleting from the table.. It's more work, but may be better than incurring the costs involved in scanning millions of records, multiple times a day.

xamarin forms azure mobile apps slow sync

I'm using Azure Mobile App with Xamarin.Forms to create an offline capable mobile app.
My solution is based on https://adrianhall.github.io/develop-mobile-apps-with-csharp-and-azure/chapter3/client/
Here is the code that I use for offline sync :
public class AzureDataSource
{
private async Task InitializeAsync()
{
// Short circuit - local database is already initialized
if (client.SyncContext.IsInitialized)
{
return;
}
// Define the database schema
store.DefineTable<ArrayElement>();
store.DefineTable<InputAnswer>();
//Same thing with 16 others table
...
// Actually create the store and update the schema
await client.SyncContext.InitializeAsync(store, new MobileServiceSyncHandler());
}
public async Task SyncOfflineCacheAsync()
{
await InitializeAsync();
//Check if authenticated
if (client.CurrentUser != null)
{
// Push the Operations Queue to the mobile backend
await client.SyncContext.PushAsync();
// Pull each sync table
var arrayTable = await GetTableAsync<ArrayElement>();
await arrayTable.PullAsync();
var inputAnswerInstanceTable = await GetTableAsync<InputAnswer>();
await inputAnswerInstanceTable.PullAsync();
//Same thing with 16 others table
...
}
}
public async Task<IGenericTable<T>> GetTableAsync<T>() where T : TableData
{
await InitializeAsync();
return new AzureCloudTable<T>(client);
}
}
public class AzureCloudTable<T>
{
public AzureCloudTable(MobileServiceClient client)
{
this.client = client;
this.table = client.GetSyncTable<T>();
}
public async Task PullAsync()
{
//Query name used for incremental pull
string queryName = $"incsync_{typeof(T).Name}";
await table.PullAsync(queryName, table.CreateQuery());
}
}
The problem is that the syncing takes a lot of time even when there isn't anything to pull (8-9 seconds on Android devices and more than 25 seconds to pull the whole database).
I looked at Fiddler to find how much time takes the Mobile Apps BackEnd to respond and it is about 50 milliseconds per request so the problem doesn't seem to come from here.
Does anyone have the same trouble ? Is there something that I'm doing wrong or tips to improve my sync performance ?
Our particular issue was linked to our database migration. Every row in the database had the same updatedAt value. We ran an SQL script to modify these so that they were all unique.
This fix was actually for some other issue we had, where not all rows were being returned for some unknown reason, but we also saw a substantial speed improvement.
Also, another weird fix that improved loading times was the following.
After we had pulled all of the data the first time (which, understandably takes some time) - we did an UpdateAsync() on one of the rows that were returned, and we did not push it afterwards.
We've come to understand that the way offline sync works, is that it will pull anything that has a date newer than the most recent updated at. There was a small speed improvement associated with this.
Finally, the last thing we did to improve speed was to not fetch the data again, if it already had cached a copy in the view. This may not work for your use case though.
public List<Foo> fooList = new List<Foo>
public void DisplayAllFoo()
{
if(fooList.Count == 0)
fooList = await SyncClass.GetAllFoo();
foreach(var foo in fooList)
{
Console.WriteLine(foo.bar);
}
}
Edit 20th March 2019:
With these improvements in place, we are still seeing very slow sync operations, used in the same way as mentioned in the OP, also including the improvements listed in my answer here.
I encourage all to share their solutions or ideas on how this speed can be improved.
One of the reasons for the slow Pull() is when more than (10) rows get the same UpdatedAt value. This happens when you update the rows at once, for example running an SQL command.
One way to overcome this is to modify the default trigger on the tables. To ensure every row gets a unique UpdateAt, we did something like this:
ALTER TRIGGER [dbo].[TR_dbo_Items_InsertUpdateDelete] ON [dbo].[TableName]
AFTER INSERT, UPDATE, DELETE
AS
BEGIN
DECLARE #InsertedAndDeleted TABLE
(
Id NVARCHAR(128)
);
DECLARE #Count INT,
#Id NVARCHAR(128);
INSERT INTO #InsertedAndDeleted
SELECT Id
FROM inserted;
INSERT INTO #InsertedAndDeleted
SELECT Id
FROM deleted
WHERE Id NOT IN
(
SELECT Id
FROM #InsertedAndDeleted
);
--select * from #InsertedAndDeleted;
SELECT #Count = Count(*)
FROM #InsertedAndDeleted;
-- ************************ UpdatedAt ************************
-- while loop
WHILE #Count > 0
BEGIN
-- selecting
SELECT TOP (1) #Id = Id
FROM #InsertedAndDeleted;
-- updating
UPDATE [dbo].[TableName]
SET UpdatedAt = Convert(DATETIMEOFFSET, DateAdd(MILLISECOND, #Count, SysUtcDateTime()))
WHERE Id = #Id;
-- deleting
DELETE FROM #InsertedAndDeleted
WHERE id = #Id;
-- counter
SET #Count = #Count - 1;
END;
END;

DynamoDB Mapper Query Doesn't Respect QueryExpression Limit

Imagine the following function which is querying a GlobalSecondaryIndex and associated Range Key in order to find a limited number of results:
#Override
public List<Statement> getAllStatementsOlderThan(String userId, String startingDate, int limit) {
if(StringUtils.isNullOrEmpty(startingDate)) {
startingDate = UTC.now().toString();
}
LOG.info("Attempting to find all Statements older than ({})", startingDate);
Map<String, AttributeValue> eav = Maps.newHashMap();
eav.put(":userId", new AttributeValue().withS(userId));
eav.put(":receivedDate", new AttributeValue().withS(startingDate));
DynamoDBQueryExpression<Statement> queryExpression = new DynamoDBQueryExpression<Statement>()
.withKeyConditionExpression("userId = :userId and receivedDate < :receivedDate").withExpressionAttributeValues(eav)
.withIndexName("userId-index")
.withConsistentRead(false);
if(limit > 0) {
queryExpression.setLimit(limit);
}
List<Statement> statementResults = mapper.query(Statement.class, queryExpression);
LOG.info("Successfully retrieved ({}) values", statementResults.size());
return statementResults;
}
List<Statement> results = statementRepository.getAllStatementsOlderThan(userId, UTC.now().toString(), 5);
assertThat(results.size()).isEqualTo(5); // NEVER passes
The limit isn't respected whenever I query against the database. I always get back all results that match my search criteria so if I set the startingDate to now then I get every item in the database since they're all older than now.
You should use queryPage function instead of query.
From DynamoDBQueryExpression.setLimit documentation:
Sets the maximum number of items to retrieve in each service request
to DynamoDB.
Note that when calling DynamoDBMapper.query, multiple
requests are made to DynamoDB if needed to retrieve the entire result
set. Setting this will limit the number of items retrieved by each
request, NOT the total number of results that will be retrieved. Use
DynamoDBMapper.queryPage to retrieve a single page of items from
DynamoDB.
As they've rightly answered the setLimit or withLimit functions limit the number of records fetched only in each particular request and internally multiple requests take place to fetch the results.
If you want to limit the number of records fetched in all the requests then you might want to use "Scan".
Example for the same can be found here

What is returned from a "await db.Database.ExecuteSqlCommandAsync(sql, parameters)"

I am trying to use this method to call a stored procedure.
var abc = await db.Database.ExecuteSqlCommandAsync(sql, parameters)
I see plenty of information on how to return data using parameters and that works okay. But I cannot find anything that tells me what is returned from the call?
Can someone tell me what will be put into abc and how can it be used?
As per
MSDN Database.ExecuteSqlCommandAsync Method (String, Object[])
it returns Task<int>. Here it is shown:
public Task<int> ExecuteSqlCommandAsync(
string sql,
params Object[] parameters
)
Return Value
Type: System.Threading.Tasks.Task<Int32>
A task that represents the asynchronous operation. The task result contains the result returned by the database after executing the command.
Now it depends on what you have in your sql query (sql param in your case). But for example, if you have a simple DELETE query like:
DELETE student WHERE papertype = 'science'
it will return the number of rows affected by the command. It's not a production level query so please ignore it's quality but you get the idea!

Resources