Solrcloud /select returns the different result than the documents that have been processed - solrcloud

abnormal behavior when running solrcloud:
Problem: DIH says document processed is x: but in query it always less than x(generally returns the x-1 or x-2).
solr-9.0.0
openjdk-11.0.2
heap-memory: default
reproduce the problem:
start solrcloud with altleast 3 instance(I have used 4 instance)
3 Zookeeper instance
configured DIH from sql server.
=> here sql table consists the path of documents of physical location from where DIH will import the file path and by using transformer class it reads the file and send it to solr.
=> documents are txt file only
okay so in a collection:
with 1 shard
2 replication factor (1 NRT and 1 TLOG)
so in this case 2 instance would be having a leader replica and 2 would be having a non-leader replica.
Documents: around 100k (most of the documents size are < 10kb) and around 100 documents are between 10 and 60 mb.
start indexing. (in any node (using solrj or by admin UI)).
so in between consider the scenario where multiple nodes crashes or restarted. in 4 solr instance cluster most probably 2.
=> this 2 restarted instance will not be in same shard so indexing could be continued. and also not the same node where indexing is started.
=> so the motive is to down the leader replica node to check fault tolerance.
Here what is the problem:
when indexing is completed it returns the status: fetched: X, processed: Y
lets say that in 100k documents fetched: 100k processed: 100k.
=> in /select * query it returns the 1 document less. e.g. total number of found documents: 99,999.
so to cross check i got all the ids of documents and compared it with the /select query resultSet and got 1 id that is not indexed.
=> I've tried it multiple times and every time it miss to indexed random number of id even it says it's processed but not in actual.
when leader replica node goes down electing other replica as leader takes too much time.
and no mention is made in solr.log file about that document.
here what i think is when tlog file is being written and at the same time if same node goes down the document couldn't be processed and the document is not even written in tlog file. let me know if I'm missing something.
all the configuration is same as default one: expect just added DIH manually in solr-9.0
running in jetty server.

Related

Marklogic commit frame/return sequence guarantee

I have a simple 1 node Marklogic server that I need to purge documents daily.
The test query below selects the documents then returns a sequence which I want to do the following:
output the name of the file being extracted
ensure the directory path exists of file in #1
save a zipped version of the document to the file in #1.
Delete the document
Is this structure safe? It returns a sequence for each document to be deleted. The last item in the returned sequence deletes the document. If any of the prior steps fail, will the document still be deleted? Should I trust the engine to execute the return sequence in order given?
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
let $dateLimitAll := current-dateTime() -xs:dayTimeDuration("P1460D")
let $dateLimitSome := current-dateTime() -xs:dayTimeDuration("P730D")
for $adoc in doc()[1 to 5]
let $docDate := $adoc/Unit/created
let $uri := document-uri($adoc)
let $path:= fn:concat("d:/purge/" , $adoc/Unit/xmldatastore/state/data(), "/", fn:year-from-dateTime($docDate), "/", fn:month-from-dateTime($docDate))
let $filename := fn:concat($path, "/", $uri, ".zip")
where ( ($docDate < $dateLimitAll) or (($docDate < $dateLimitSome) and ($adoc/Unit/xmldatastore/state != "FIRMED") and ($adoc/Unit/xmldatastore/state != "SHIPPED")))
return ( $filename, xdmp:filesystem-directory-create($path, map:new(map:entry("createParents", fn:true()))), xdmp:save($filename, xdmp:zip-create(<parts xmlns="xdmp:zip"><part>{$uri}</part></parts>, doc($uri))), xdmp:document-delete($uri) )
p.s. please ignore the [1 to 5] doc limit. Added for testing.
If any of the prior steps fail, will the document still be deleted?
If there is an error in the execution of that module, the transaction will rollback and the delete from the database will be undone.
However, the directory and zip file written to the filesystem will persist and will not be deleted. The xdmp:filesystem-directory-create() and xdmp:save() functions do not rollback or get undone if a transaction rolls back.
Should I trust the engine to execute the return sequence in order given?
Not sure that it matters much, given the statement above.
Is this structure safe?
It is unclear how many documents you might be dealing with. You may find that the filter is better/faster using cts:search and some indexes to target the candidate documents. Also, even if you can select the set of documents to process faster, if there are a lot of documents, you could still exceed execution time limits.
Another approach might be to break up the work. Select the URIs of the documents that match the criteria, and then have separate query executions for each document that is responsible for saving the zip file and deleting the document from the database. This is likely to be faster, as you can process multiple documents in parallel, avoids the risk of a timeout, and in the event of an exception, allows for some items to fail without causing the entire set to fail and rollback.
Tools such as CoRB were built exactly for this type of batch work.

Cosmos db to store and retrieve thousands of documents with in seconds

I am storing millions of documents in cosmos db with proper partitionkey. I need to retrieve say 500,000 documents to do some calculations and display the output in UI , this should happen with in say 10 second.
Would this be possible? I have tried this but taking nearly a minute. So for this kind of requirement is this the correct approach?
"id": "Latest_100_Sku1_1496188800",
"PartitionKey": "Latest_100_Sku1
"SnapshotType": 2,
"AccountCode": "100",
"SkuCode": "Sku1",
"Date": "2017-05-31T00:00:00",
"DateEpoch": 1496188800,
"Body": "rVNBa4MwFP4v72xHElxbvYkbo4dBwXaX0UOw6ZRFIyaBFfG/7zlT0EkPrYUcku+9fO/7kvca"
Size of one document : 825 byte
Am using autoscale 4000 Throughput
Query statistics - am using 2 queries.
Query 1 - select * from c where c.id in ({ids})
here i use PartitionKey in Query options.
Query Statistics
METRIC
VALUE
Request Charge
102.11 RUs
Showing Results
1 - 100
Retrieved document count More information
200
Retrieved document size More information
221672 bytes
Output document count More information
200
Output document size More information
221972 bytes
Index hit document count More information
200
Index lookup time More information
17.0499 ms
Document load time More information
1.59 ms
Query engine execution time More information
0.3401 ms
System function execution time More information
0.060000000000000005 ms
User defined function execution time More information
0 ms
Document write time More information
0.16 ms
Round Trips
1
Query 2 --
select * from c where c.PartitionKey in ({keys}) and c.DateEpoch>={startDate.ToEpoch()} and c.DateEpoch<={endDate.ToEpoch()}
Query Statistics
METRIC
VALUE
Request Charge
226.32 RUs
Showing Results
1 - 100
Retrieved document count More information
200
Retrieved document size More information
176580 bytes
Output document count More information
200
Output document size More information
176880 bytes
Index hit document count More information
200
Index lookup time More information
88.31 ms
Document load time More information
4.2399000000000004 ms
Query engine execution time More information
0.4701 ms
System function execution time More information
0.060000000000000005 ms
User defined function execution time More information
0 ms
Document write time More information
0.19 ms
Round Trips
1
Query #1 looks fine. Query #2 most likely would benefit from a composite index on DateEpoch. I'm not sure what the UDF is but if you're converting dates to epoch you want to read a new blog post New date and time system functions in Azure Cosmos DB
Overall, retrieving 500K documents in 1-2 queries to do some calculations seems like a strange use case. Typically most people will pre-calculate values and persist them using a materialized view pattern using change feed. Depending on how often you run these two queries, this is often a more efficient use of compute resources.

Cosmos db OFFSET LIMIT clause issue

I am querying a Cosmos DB using the REST API. I am having problems with the 'OFFSET LIMIT' clause. I have tested this both with my code (Dart) and Postman with the same results:
This query works ok:
SELECT * FROM Faults f WHERE CONTAINS(f.Key, 'start', true)
This query does not work. Same as 1 but using OFFSET and LIMIT to get a subset:
SELECT * FROM Faults f
WHERE CONTAINS(f.Key, 'start', true)
OFFSET 10 LIMIT 10
This query works ok. Same as 2. but with an additional filter
SELECT * FROM Faults f
WHERE CONTAINS(f.Key, 'start', true)
AND f.Node = 'XI'
OFFSET 10 LIMIT 10
I don't get why if 1 and 3 are working 2 is not.
This is the response from query 2:
{
"code": "BadRequest",
"message": "The provided cross partition query can not be directly served by the gateway. This is a first chance (internal) exception that all newer clients will know how to handle gracefully. This exception is traced, but unless you see it bubble up as an exception (which only happens on older SDK clients), then you can safely ignore this message.\r\nActivityId: 5918ae0e-71ab-48a4-aa20-edd8427fe21f, Microsoft.Azure.Documents.Common/2.11.0",
"additionalErrorInfo": "{\"partitionedQueryExecutionInfoVersion\":2,\"queryInfo\":{\"distinctType\":\"None\",\"top\":null,\"offset\":10,\"limit\":10,\"orderBy\":[],\"orderByExpressions\":[],\"groupByExpressions\":[],\"groupByAliases\":[],\"aggregates\":[],\"groupByAliasToAggregateType\":{},\"rewrittenQuery\":\"SELECT *\\nFROM Faults AS f\\nWHERE CONTAINS(f.Key, \\\"start\\\", true)\\nOFFSET 0 LIMIT 20\",\"hasSelectValue\":false},\"queryRanges\":[{\"min\":\"\",\"max\":\"FF\",\"isMinInclusive\":true,\"isMaxInclusive\":false}]}"
}
Thanks for your help
It seems that you can't execute cross partition query through REST API.
Probably, you have to use the official SDKs.
Cosmos DB : cross partition query can not be directly served by the gateway
Thanks decoy for putting me in the right direction.
OFFSET LIMIT is not supported by the REST API.
Pagination though can be achieved with the headers without using the SDK.
Set on your first request:
x-ms-max-item-count to the amount of records you want to retrieve at a time e.g. 10.
With the response you get the header:
x-ms-continuation String that points to the next document.
To get the next 10 documents send a new request with the headers:
x-ms-max-item-count = 10. Just like the first one.
x-ms-continuation set to the value you got from the response.
So it is very easy to get the next documents but it is not straightforward to get the previous ones. I had to save the document nº and 'x-ms-continuation' strings as key-value pairs and use them to implement a 'search previous' pagination.
I don't know if there is an easier way to do it.

Most efficient way to query across partition with CosmosDB for bulk operations

I have a cross partition query that returns the rows for each partition in turn, which makes sense, all of partition 1’s results, all of partition 2’s etc.
For each row returned I need to perform an action, could be a delete or update.
There are too many records to read them all in and then perform the actions, so I need to stream in the results and perform the actions at the same time.
The issue I get is I run out of RU very quickly as my actions get run on each partition in turn and a single partition has a tenth of the RUs allocated.
I can specify a PartitionKey in the FeedOptions but that does not help me as I don’t know what the key will be.
My query looks like
select r.* from r where r.deleted
the partition is on a field called container
Imagine I have the following items
container|title |deleted
jamjar |jam |true <--- stored in partition 5
jar |pickles |true <--- stored in partition 5
tin |cookies |true <--- stored in partition 8
tub |sweets |true <--- stored in partition 9
I do select r.title from r where r.deleted
my query will return the rows in the following order
jam <--- stored in partition 5
pickles <--- stored in partition 5
cookies <--- stored in partition 8
sweets <--- stored in partition 9
I use an ActionBlock to allow me to spin up 2 threads to do my action on each row returned, so I work on jam and pickles then cookies and sweets thus consuming RUs from partion 5 when I am carrying out the action on jam and pickles
I would like the results to be returned as:
jam <--- stored in partition 5
cookies <--- stored in partition 8
sweets <--- stored in partition 9
pickles <--- stored in partition 5
For normal API calls we always know the container, this is a requirement for a bulk and very infrequent delete.
If know the number of partition and could supply the partition number to the query that would be fine, I would be happy to issue 10 query and just treat this as 10 separate jobs.
You need to set the MaxDegreeOfParallelism which is part of the FeedOptions :
FeedOptions queryOptions = new FeedOptions
{
EnableCrossPartitionQuery = true,
MaxDegreeOfParallelism = 10,
};
It will create a client thread for each partion, you can see what is happening if you inpsect the HTTP Headers
x-ms-documentdb-query-enablecrosspartition: True
x-ms-documentdb-query-parallelizecrosspartitionquery: True
x-ms-documentdb-populatequerymetrics: False
x-ms-documentdb-partitionkeyrangeid: QQlvANNcKgA=,3
Notice the QQlvANNcKgA=,3 you see 10 of these with ,0 through to ,9 i suspect the first part is some page tracking and the second part is the partition
See the docs Parallel query execution
Here's the timeline view of 3 queries in Fiddler:
MaxDegreeOfParallelism = 10: slower and not quite parallel, while the threads and connections are spun up (you can see the 5 extra SSL handshakes in the listing on the left, and a gap before the last 5 requests of the 'green' set in the timeline). There are also 2 (for some reason) requests to get the PK ranges for the collection
MaxDegreeOfParallelism = 10 (again) : almost optimally parallel. The PK range info seems to be cached from the previous request and reused here without making any extraneous requests.
MaxDegreeOfParallelism = 0: completely sequential.
Interestingly, these requests don't specify a x-ms-documentdb-partitionkeyrangeid header.
The query is run against a collection with 6 physical partitions, using DocumentClient v2.x.
Notice also that 7 requests are fired for every query, the 1st one is a 'query plan request' (not parallelizable) while the following 6 return the actual data.

Behavior of Oracle GoldenGate CSN when multiple replicas exist?

Need to get more details about Oracle GoldenGate CSN.
Following is the system architecture configured.
Source DataBase - Oracle
Target DataBase - Oracle
For each table on souce database, 2 tables (BASE table and DELETE table) are defined on target database.
Configured 2 replicas to transfer data from source to target database.
One replica moves the INSERT/UPDATES to target database and other replica moves the DELETE records to target.
Following is the view defined by GG which gives GoldenGate metadata information.
GoldenGate metadata info
Row with servername ends with 'CRN01A' represents the GG replica for BASE table.
Row with servername ends with 'CRN01D' represents the GG replica for DELETE table.
APPLIED_LOW_POSITION gives 'All messages with commit position less than this value have been applied'.
Our question is if the both replicas will have their own isolated CSN or in synchronization with respect to extract.
Example:
APPLIED_LOW_POSITION initial value - 100 for both BASE table replica and DELETE table replica.
100 INSERTS/UPDATES occured on source DB. BASE table replica APPLIED_LOW_POSITION value changed to 200.
After step2, 3 DELETEs occured on source system. Our question is at this point of time, what will be the APPLIED_LOW_POSITION value for DELETE replica?
Will it be 103 or 203?
Can you please provide your thoughts?
The CSN counter is connected with the CSN number on the source database. It always grows.
If the CSN was equal 100, then some transactions (INSERT) have occurred. It grew to 200.
Then 3 additional DELETE operations were made. And from 200 it would grow to 203.
Those 3 DELETE operations may not be replicated to the INSERT/UPDATE target, so the APPLIED_LOW_POSITION may not change on this target and stay at the 200 level.
But on the DELETE target the APPLIED_LOW_POSITION would rise to 203.

Resources