MongoTemplate Limit query issue - count

I am using MongoTemplate to execute my Mongo queries.
I wanted to know if count works with limit set?
Also why find query searches full collection (according to query) although limit is set?
For e.g. the query i wrote might result in having 10000 records, but i want only 100 records and for that i have set limit to 100 and then fired find query. But still query goes on to search full 10000 records.
dataQuery.limit(100);
List<logs> logResultsTemp = mongoTemplate1.find(dataQuery, logs.class);
Is their any limitations in using limit command?

Limit works fine (at least on spring data version 1.2.1 that I use). Perhaps it was a problem on your your version?
About count, there is a specific method to get your collection count, so you don't need to care about the amount of data that your system will fetch:
mongoTemplate.count(new Query(), MyCollection.class)
Btw, if you try this directly on your mongodb console: db.myCollection.limit(1).count() you will get the actual total of documents in your collection, not only one. An so it is for the mongoTemplate.count method, so:
mongoTemplate.count(new Query().limit(1), MyCollection.class)
will work the same way.

Related

How can i query a large result set in Kusto explorer?

I am trying to return more than 1 million records from Kusto database in Kusto explorer but I am getting this error below
Query result set has exceeded the internal record count limit 500000 (E_QUERY_RESULT_SET_TOO_LARGE; see http://aka.ms/kustoquerylimits)
I think the limit is 5000000. Any ideas how can I achieve this? thanks
set notruncation;
It's strongly recommended that, in this case, some form of limitation
is still put in place.
set truncationmaxsize=YOUR_LIMIT;
set truncationmaxrecords=YOUR_LIMIT;
Reference : https://learn.microsoft.com/en-us/azure/data-explorer/kusto/concepts/querylimits#limit-on-result-set-size-result-truncation
For large data exports - please use server-export options described here:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/data-export/
Service-side export (push): The methods above are somewhat limited since the query results have to be streamed through a single network connection between the producer doing the query and the consumer writing its results. For scalable data export, Kusto provides a "push" export model in which the service running the query also writes its results in an optimized manner. This model is exposed through a set of .export control commands, supporting exporting query results to an external table, a SQL table, or an external Blob storage.
Additional option is to use Kusto Explorer "run query into csv" button, this will add the set notruncation; for you and will save the results directly to disk, so that you can easily open the results in other tools such as Excel.

Google Datastore python returning less number of entities per page

I am using Python client SDK for Datastore (google-cloud-datastore) version 1.4.0. I am trying to run a key-only query fetch:
query = client.query(kind = 'SomeEntity')
query.keys_only()
Query filter has EQUAL condition on field1 and GREATER_THAN_OR_EQUAL condition on field2. Ordering is done based on field2
For fetch, I am specifying a limit:
query_iter = query.fetch(start_cursor=cursor, limit=100)
page = next(query_iter.pages)
keyList = [entity.key for entity in page]
nextCursor = query_iter.next_page_token
Though there are around 50 entities satisfying this query, each fetch returns around 10-15 results and a cursor. I can use the cursor to get all the results; but this results in additional call overhead
Is this behavior expected?
keys_only query is limited to 1000 entries in a single call. This operation counts as a single entity read.
For another limitations of Datastore, please refer detailed table in the documentation.
However, in the code, you did specify cursor as a starting point for a subsequent retrieval operation. Query can be limited, without cursor:
query = client.query()
query.keys_only()
tasks = list(query.fetch(limit=100))
For detailed instruction how to use limits and cursors, please refer documentation of the Google Gloud Datastore

DynamoDBScanExpression withLimit returns more records than Limit

Have to list all the records from a DynamoDB table, without any filter expression.
I want to limit the number of records hence using DynamoDBScanExpression with setLimit.
DynamoDBScanExpression scanExpression = new DynamoDBScanExpression();
....
// Set ExclusiveStartKey
....
scanExpression.setLimit(10);
However, the scan operation returns more than 10 results always !!!!
Is this the expected behaviour and if so how?
Python Answer
It is not possible to set a limit for scan() operations, however, it is possible to do so with a query.
A query searches through items, the rows in the database. It starts at the top or bottom of the list and finds items based on set criteria. You must have a partion and a sort key to do this.
A scan on the other hand searches through the ENTIRE database and not by items, and, as a result, is NOT ordered.
Since queries are based on items and scan is based on the ENTIRE database, only queries can support limits.
To answer OP's question, essentially it doesn't work because you're using scan not query.
Here is an example of how to use it using CLIENT syntax. (More advanced syntax version. Sorry I don't have a simpler example that uses resource. you can google that.)
def retrieve_latest_item(self):
result = self.dynamodb_client.query(
TableName="cleaning_company_employees",
KeyConditionExpression= "works_night_shift = :value",
ExpressionAttributeValues={':value': {"BOOL":"True"}},
ScanIndexForward = False,
Limit = 3
)
return result
Here is the DynamoDB module docs

Dealing with Indexer timeout when importing documentdb into Azure Search

I'm trying to import a rather large (~200M docs) documentdb into Azure Search, but I'm finding the indexer times out after ~24hrs. When the indexer restarts, it starts again from the beginning, rather than from where it got to, meaning I can't get more than ~40M docs into the search index. The data source has a highwater mark set like this:
var source = new DataSource();
source.Name = DataSourceName;
source.Type = DataSourceType.DocumentDb;
source.Credentials = new DataSourceCredentials(myEnvDef.ConnectionString);
source.Container = new DataContainer(myEnvDef.CollectionName, QueryString);
source.DataChangeDetectionPolicy = new HighWaterMarkChangeDetectionPolicy("_ts");
serviceClient.DataSources.Create(source);
The highwater mark appears to work correctly when testing on a small db.
Should the highwater mark be respected when the indexer fails like this, and if not how can I index such a large data set?
The reason the indexer is not making incremental progress even while timing out after 24 hours (the 24 hour execution time limit is expected) is that a user-specified query (QueryString argument passed to the DataContainer constructor) is used. With a user-specified query, we cannot guarantee and therefore cannot assume that the query response stream of documents will be ordered by the _ts column, which is a necessary assumption to support incremental progress.
So, if a custom query isn't required for your scenario, consider not using it.
Alternatively, consider partitioning your data and creating multiple datasource / indexer pairs that all write into the same index. You can use Datasource.Container.Query parameter to provide a DocumentDB query that partitions your data using a WHERE filter. That way, each of the indexers will have less work to do, and with sufficient partitioning, will fit under the 24 hour limit. Moreover, if your search service has multiple search units, multiple indexers will run in parallel, further increasing the indexing throughout and decreasing the overall time to index your entire dataset.

Exists firebase-util in Java

I found the firebase-util and it is great.
Does firebase-util exist for Java? Or is possible to use "join" in Java?
I was testing firebase-util and I found that it is a little bit slow. Is it appropriate to join 1:1 rather than 10000 rows with 100 rows (where is better load 10000 a then - if it is needed - join)?
Thanks for reply
There is not currently a version of Fireabse-util for Java. Since this is still an experimental lib, the Firebase team is still soliciting feedback and determining the appropriate API. At some point in the near future, there will also be querying features rolled into the core API, which will be considerably more efficient and complete than this client-side helper lib.
It shouldn't matter if you join 1:1 or 1:many, but 10,000 rows is a very huge number for a join utility on the client. You wouldn't be able to display this many in the DOM at one point anyway as that would be even slower. A better solution would be to create an index and do an intersection against that, only fetching a small subset of the records:
// only get the first twenty from path A
var refA = new Firebase('URL/path/A').limit(20);
var refB = new Firebase('URL/path/B');
// intersection only loads if key exists in both paths,
// so we've filtered B by the subset of A
var joinedRef = new Firebase.util.intersection(refA, refB);
This would only fetch records in refB that exist in refA, and thus, only the first 10. You can also create an index of specific record ids to fetch, or query for a subset based on priorities, and then use intersection to reduce the payload.

Resources