Dealing with Indexer timeout when importing documentdb into Azure Search - azure-cosmosdb

I'm trying to import a rather large (~200M docs) documentdb into Azure Search, but I'm finding the indexer times out after ~24hrs. When the indexer restarts, it starts again from the beginning, rather than from where it got to, meaning I can't get more than ~40M docs into the search index. The data source has a highwater mark set like this:
var source = new DataSource();
source.Name = DataSourceName;
source.Type = DataSourceType.DocumentDb;
source.Credentials = new DataSourceCredentials(myEnvDef.ConnectionString);
source.Container = new DataContainer(myEnvDef.CollectionName, QueryString);
source.DataChangeDetectionPolicy = new HighWaterMarkChangeDetectionPolicy("_ts");
serviceClient.DataSources.Create(source);
The highwater mark appears to work correctly when testing on a small db.
Should the highwater mark be respected when the indexer fails like this, and if not how can I index such a large data set?

The reason the indexer is not making incremental progress even while timing out after 24 hours (the 24 hour execution time limit is expected) is that a user-specified query (QueryString argument passed to the DataContainer constructor) is used. With a user-specified query, we cannot guarantee and therefore cannot assume that the query response stream of documents will be ordered by the _ts column, which is a necessary assumption to support incremental progress.
So, if a custom query isn't required for your scenario, consider not using it.
Alternatively, consider partitioning your data and creating multiple datasource / indexer pairs that all write into the same index. You can use Datasource.Container.Query parameter to provide a DocumentDB query that partitions your data using a WHERE filter. That way, each of the indexers will have less work to do, and with sufficient partitioning, will fit under the 24 hour limit. Moreover, if your search service has multiple search units, multiple indexers will run in parallel, further increasing the indexing throughout and decreasing the overall time to index your entire dataset.

Related

How to Store common query statements (Kusto.Explorer, KQL, Kusto, Azure Data Explorer, ADX)

Is there a way to store common query statements in Kusto.Explorer for future use. For example:
Most of my queries start with:
set notruncation;
set maxmemoryconsumptionperiterator=68719476736;
set servertimeout = timespan(15m);
I would like to use a 'variable name' to reference these instead of explicitly calling them out every time. Something like this:
Setlimitations
T
| summarize count() by Key
set statements, when used, must be specified as part of each request.
However, you can define a request limits policy on the default / a custom workload group with the same settings, and those will apply to all requests classified to that workload group.
also see: https://y0nil.github.io/kusto.blog/blog-posts/workload-groups.html
do note that always running with notruncation, a very high maxmemoryconsumptionperiterator and an extended servertimeout probably indicates some inefficiency in your workload, and you may want to revisit the reason for these being used to begin with
e.g. if you're frequently exporting large volumes of data, you may prefer exporting them to cloud storage instead of via a query.

Cosmos DB .NET SDK order by a dynamic field (parameterized)

I use the .NET SDK to retrieve some items in a Cosmos DB instance using continuationTokens to be able to retrieve paginated pieces of data. So far this works.
I use a generic Get function to retrieve the items:
var query = container.GetItemQueryIterator<T>(
new QueryDefinition("SELECT * FROM c"),
continuationToken: continuationToken,
requestOptions: new QueryRequestOptions()
{
MaxItemCount = itemCount
});
However I would like to add a dynamic order by field where the callee can decide on which field the results should be ordered. I tried adding a parameterized field like:
new QueryDefinition("SELECT * FROM c order by #orderBy")
.WithParameter("#orderBy", "fieldname")
But this does not work, I keep getting Syntax errors while executing, is it actually possible to dynamically add an order by clause?
The .WithParameter() fluent syntax can only be used with the WHERE clause in QueryDefinition so you will have to construct your sql with the order by appended dynamically to the sql string.
One thing to keep in mind is that unless this is a small workload with less than 20GB of data, this container will not scale unless you use the partition key in your queries. The other consideration here too is that order by gets much better performance when you using composite indexes. But if there are a wide number of properties that results can be sorted on, writes may get very expensive from all of the individual composite indexes.
In all cases, if this is meant to scale you should measure and benchmark high concurrency operations.

Google Datastore python returning less number of entities per page

I am using Python client SDK for Datastore (google-cloud-datastore) version 1.4.0. I am trying to run a key-only query fetch:
query = client.query(kind = 'SomeEntity')
query.keys_only()
Query filter has EQUAL condition on field1 and GREATER_THAN_OR_EQUAL condition on field2. Ordering is done based on field2
For fetch, I am specifying a limit:
query_iter = query.fetch(start_cursor=cursor, limit=100)
page = next(query_iter.pages)
keyList = [entity.key for entity in page]
nextCursor = query_iter.next_page_token
Though there are around 50 entities satisfying this query, each fetch returns around 10-15 results and a cursor. I can use the cursor to get all the results; but this results in additional call overhead
Is this behavior expected?
keys_only query is limited to 1000 entries in a single call. This operation counts as a single entity read.
For another limitations of Datastore, please refer detailed table in the documentation.
However, in the code, you did specify cursor as a starting point for a subsequent retrieval operation. Query can be limited, without cursor:
query = client.query()
query.keys_only()
tasks = list(query.fetch(limit=100))
For detailed instruction how to use limits and cursors, please refer documentation of the Google Gloud Datastore

DynamoDBScanExpression withLimit returns more records than Limit

Have to list all the records from a DynamoDB table, without any filter expression.
I want to limit the number of records hence using DynamoDBScanExpression with setLimit.
DynamoDBScanExpression scanExpression = new DynamoDBScanExpression();
....
// Set ExclusiveStartKey
....
scanExpression.setLimit(10);
However, the scan operation returns more than 10 results always !!!!
Is this the expected behaviour and if so how?
Python Answer
It is not possible to set a limit for scan() operations, however, it is possible to do so with a query.
A query searches through items, the rows in the database. It starts at the top or bottom of the list and finds items based on set criteria. You must have a partion and a sort key to do this.
A scan on the other hand searches through the ENTIRE database and not by items, and, as a result, is NOT ordered.
Since queries are based on items and scan is based on the ENTIRE database, only queries can support limits.
To answer OP's question, essentially it doesn't work because you're using scan not query.
Here is an example of how to use it using CLIENT syntax. (More advanced syntax version. Sorry I don't have a simpler example that uses resource. you can google that.)
def retrieve_latest_item(self):
result = self.dynamodb_client.query(
TableName="cleaning_company_employees",
KeyConditionExpression= "works_night_shift = :value",
ExpressionAttributeValues={':value': {"BOOL":"True"}},
ScanIndexForward = False,
Limit = 3
)
return result
Here is the DynamoDB module docs

Exists firebase-util in Java

I found the firebase-util and it is great.
Does firebase-util exist for Java? Or is possible to use "join" in Java?
I was testing firebase-util and I found that it is a little bit slow. Is it appropriate to join 1:1 rather than 10000 rows with 100 rows (where is better load 10000 a then - if it is needed - join)?
Thanks for reply
There is not currently a version of Fireabse-util for Java. Since this is still an experimental lib, the Firebase team is still soliciting feedback and determining the appropriate API. At some point in the near future, there will also be querying features rolled into the core API, which will be considerably more efficient and complete than this client-side helper lib.
It shouldn't matter if you join 1:1 or 1:many, but 10,000 rows is a very huge number for a join utility on the client. You wouldn't be able to display this many in the DOM at one point anyway as that would be even slower. A better solution would be to create an index and do an intersection against that, only fetching a small subset of the records:
// only get the first twenty from path A
var refA = new Firebase('URL/path/A').limit(20);
var refB = new Firebase('URL/path/B');
// intersection only loads if key exists in both paths,
// so we've filtered B by the subset of A
var joinedRef = new Firebase.util.intersection(refA, refB);
This would only fetch records in refB that exist in refA, and thus, only the first 10. You can also create an index of specific record ids to fetch, or query for a subset based on priorities, and then use intersection to reduce the payload.

Resources