I have a requirement where I iterate through 10,000,000 documents and for each document I do some operation and store some values in '/count.xml'. When I iterate to second document I update '/count.xml' with updated value
Currently this is what I am doing, here $total-records is 10,000,000
let $total-records := xdmp:estimate(cts:search( //some code))
let $batch-size := 5000
let $pagination := 0
let $bs :=
for $records in 1 to fn:ceiling($total-records div $batch-size )
let $start := fn:sum($pagination + 1)
let $end := fn:sum($batch-size + $pagination)
let $_ := xdmp:set($pagination, $end)
return
xdmp:spawn-function
(
function() {
for $each in cts:search( //some code)[$start to $end]
return //some operation and update '/count.xml' with some updated values
},
<options xmlns="xdmp:eval"><commit>auto</commit><update>true</update</options>
)
let $doc := doc("/count.xml")
return ()
So here the issue is I need to read the '/count.xml' file after all documents are iterated, But with above code using spawn task
let $doc := doc("/count.xml")
will not be latest one as above spawn task will run on different threads.
I need a solution where
let $doc := doc("/count.xml")
waits till all spawn task are completed.
I have came across
<result>{fn:true()}</result>
option as well, but I do not know whether it will work or not because variable
$bs
not being used anywhere and documentation says 'When the calling request uses the value future in any operation, it will automatically wait for the spawned task to complete and it will use the result.'
Is there any other alternative where
let $doc := doc("/count.xml")
line will be executed only after all spawn task are completed
To process 10 mln documents, you probably need to spawn something like 10.000 batches of a 1000 docs. I don't think that will work well from within MarkLogic.
I'd advice looking into the built-in aggregation features of MarkLogic. See for instance cts:sum-aggregate. You might be able to pre-calculate per-document intermediate results, that you could aggregate at run-time using those aggregation features. That would definitely be most performant, and would scale best.
Alternative would be to orchestrate your calculations from outside of MarkLogic. Otherwise you end up either flooding the task queue, or running into timeout limits, or both. Tools like Corb2 and DMSDK could be of help with this.
Note: you can indeed make spawns wait for result by using the <result> option, but either use <result>true</result> or <result>{fn:true()}</result> (note the parentheses behind fn:true, it is a function).
HTH!
The requirement as given, one cannot tell the difference between writing once the final result of a query across 10,mil docs vs writing the result after query of 1 document at a time. Since your example does no writes to the queried documents it need not be spawned nor run in a seperate thread or transaction, rather as HTH says, you can aka use of aggregate functions to do a single query over the entire set, compute the final result and store it in 1 operation. Likely this will run very quickly (or can be made to).
If the requirements are actually that each single document MUST be queried, then sequentially another shared document written to -- this can only be observed by using seperate transactions, serially. Its going to be horrendously slow, almost certainly longer then the timeout for the calling transaction. This means you must orchestrate it from outside -- if the requirement is that the same caller start the process as finish it (a highly implementation specific requirement that if true is likely to have other implications beyond those given).
Something close thats achievable but still horrendously slow is to have an outside query poll on the updated shared document and return 'success' once the job is done.
But again, with this many documents, if your forcing a write transaction for each one, its going to take longer (or atleast is not easily guaranteed to NOT take longer) then the a single transaction timeout so must be invoked from 'outside'.
This is where I would recommend revisiting the requirements to determine the core functionality/result that is desired and if it is truly required to implement exactly as stated vs a more performant implementation that achieves the desired result.
If the core functionality needed is that every single query be 'checkpointed' with a document update, then there are other implications such as transaction rollback that need to be considered.
Related
My painful hunt for this feature is fully described in disgustingly log question: Several last offsets aren't getting commited with reactive kafka and it shows my multiple attemps with different failures.
How would one subscribe to ReactiveKafkaConsumerTemplate<String, String>, which will process the records in synchronous way (for simplicity), and will ack/commit every 2s AND upon manual cancellation of stream? Ie. it works, ack/commits every 2s. Then via rest/jmx/whatever comes signal, the stream terminates and ack/commits the last processed kafka record.
After a lot of attempts I was able to come up with following solution. It seems to work, but it's kinda ugly, because it's very "white-box" where outer flow highly depends on stuff happening inside of other methods. Please criticise and suggest improvements. Thanks.
kafkaReceiver.receive()
.flatMapSequential(receivedKafkaRecord -> processKafkaRecord(receivedKafkaRecord), 16)
.takeWhile(e-> !stopped)
.sample(configuration.getKafkaConfiguration().getCommitInterval())
.concatMap(offset -> {
log.debug("ack/commit offset {}", offset.offset());
offset.acknowledge();
return offset.commit();
})
.doOnTerminate(()-> log.info("stopped."));
What didn't work:
A) you cannot use Disposable.dispose, since that would break the stream and your latest processed record won't be committed.
B) you cannot put take on top of stream, as that would cancel the stream and you won't be able to commit either.
C) not sure how I'd be able to intercorporate usage of errors here.
Because of what didn't work stream termination is triggered by boolean field named stopped, which can be set anyhow.
Flow explained:
flatMapSequential — because of inner parallelism and necessity to commit N only if all N-1 was processed.
processKafkaRecord returns Mono<ReceiverOffset>, ie. the offset of processed record to have something to ack/commit. When stopped the method will skip processing and return Mono.empty
take will stop stream if stopped, this has to be put here becaue of possibility of whole sample interval consisting only from "empties"
rest is simple: sample by given interval, commit in order. If sample does return empty record, commit is skipped. Finally we log that stream is cancelled.
If anyone know how to improve, please criticise.
In general, if I want to be sure what happens when several threads make concurrent updates to the same item in DynamoDB, I should use conditional updates (i.e.,"optimistic locking"). I know that. But I was wondering if there is any other case when I can be sure that concurrent updates to the same item survive.
For example, in Cassandra, making concurrent updates to different attributes of the same item is fine, and both updates will eventually be available to read. Is the same true in DynamoDB? Or is it possible that only one of these updates survive?
A very similar question is what happens if I add, concurrently, two different values to a set or list in the same item. Am I guaranteed that I'll eventually see both values when I read this set or list, or is it possible that one of the additions will mask out the other during some sort of DynamoDB "conflict resolution" protocol?
I see a version of my second question was already asked here in the past Are DynamoDB "set" values CDRTs?, but the answer refered to a not-very-clear FAQ entry which doesn't exist any more. What's I would most like to see as an answer to my question is an official DynamoDB documentation that says how DynamoDB handles concurrent updates when neither "conditional updates" nor "transactions" are involved, and in particular what happens in the above two examples. Absent such official documentation, does anyone have any real-world experience with such concurrent updates?
I just had the same question and came across this thread. Given that there was no answer I decided to test it myself.
The answer, as far as I can observe is that as long as you are updating different attributes it will eventually succeed. It does take a little bit longer the more updates I push to the item so they appear to be written in sequence rather than in parallel.
I also tried updating a single List attribute in parallel and this expectedly fail, the resulting list once all queries had completed was broken and only had some of the entries pushed to it.
The test I ran was pretty rudimentary and I might be missing something but I believe the conclusion to be correct.
For completeness, here is the script I used, nodejs.
const aws = require('aws-sdk');
const ddb = new aws.DynamoDB.DocumentClient();
const key = process.argv[2];
const num = process.argv[3];
run().then(() => {
console.log('Done');
});
async function run() {
const p = [];
for (let i = 0; i < num; i++) {
p.push(ddb.update({
TableName: 'concurrency-test',
Key: {x: key},
UpdateExpression: 'SET #k = :v',
ExpressionAttributeValues: {
':v': `test-${i}`
},
ExpressionAttributeNames: {
'#k': `k${i}`
}
}).promise());
}
await Promise.all(p);
const response = await ddb.get({TableName: 'concurrency-test', Key: {x: key}}).promise();
const item = response.Item;
console.log('keys', Object.keys(item).length);
}
Run like so:
node index.js {key} {number}
node index.js myKey 10
Timings:
10 updates: ~1.5s
100 updates: ~2s
1000 updates: ~10-20s (fluctuated a lot)
Worth noting is that the metrics show a lot of throttled events but these are handled internally by the nodejs sdk using exponential backoff so once the dust settled everything was written as expected.
Your post contains quite a lot of questions.
There's a note in DynamoDB's manual:
All write requests are applied in the order in which they were received.
I assume that the clients send the requests in the order they were passed through a call.
That should resolve the question whether there are any guarantees. If you update different properties of an item in several requests updating only those properties, it should end up in an expected state (the 'sum' of the distinct changes).
If you, on the other hand, update the whole object, the last one will win.
DynamoDB has #DynamoDbVersion which you can use for optimistic locking to manage concurent writes of whole objects.
For scenarios like auctions, parallel tick counts (such as "likes"), DynamoDB offers AtomicCounters.
If you update a list, that depends on if you use the DynamoDB's list type (L), or if it is just a property and the client translates the lists into a String (S). So if you read a property, change it, and write, and do that in parallel, the result will be subject to eventual consistency - what you will read may not be the latest write. Applied to lists, and several times, you'll end up with some of the elements added, and some not (or, better said, added but then overwritten).
This Question is in reference with Data Hub Framework-
I am having 3-4 conditions in which i am doing operations like xdmp:node-replace and xdmp:document-delete and after all the conditions i am trying to insert the document using xdmp:document-insert.
When i am running the conditions independently by commenting the other conditions then it is working fine but if i am trying to run 2 or more conditions together- i am getting XDMP-CONFLICTINGUPDATES
$envelope is coming from STAGING Database which i am using in writer.xqy
The code sample is as below-
let $con1 := if($envelope/*:test/text() eq "abc")
then xdmp:node-replace(....) else ()
let $con2 := if($envelope/*:test/text() eq "123")
then xdmp:node-replace(....) else ()
let $con1 := if($envelope/*:test/text() eq "cde")
then xdmp:document-delete(....) else ()
return if($envelope//*FLAG/text() eq "1")
then
xdmp:document-insert($id, $envelope, xdmp:default-permissions(), map:get($options, "entity"))
Any Suggestions ?
XDMP-ConflictingUpdates means you are trying to update the same node more than once within a single transaction. Solving these types of errors can be infamously tricky and are a rite of passage for every MarkLogician.
In your case, this is caused by updating a node with xdmp:node-replace and then updating the document node which is the parent of that node with xdmp:document-insert. Thus, because you are updating both the node and its parent, you are in effect updating that node twice causing the error. Or, this may also occur from trying to both delete and insert a document at the same URI within the same transaction.
Here is a simple query you can run in QConsole to reproduce this behavior:
xquery version "1.0-ml";
xdmp:document-insert("/test.xml", <test><value></value></test>);
xquery version "1.0-ml";
let $d := fn:doc("/test.xml")
let $_ := xdmp:node-replace($d//value, <value>test</value>)
return
xdmp:document-insert("/test.xml", $d)
In the case of this demonstration, as well as your code, the xdmp:document-insert is redundant and can simply be removed.
Likely the XQuery statement above is attempting multiple updates to the same node in the same single-statement transaction. The xdmp:node-replace calls are performing updates at each operation to the same node. See the documentation for more details.
Here are two solutions that may work for you
Use conditional statements to decide what kind of update needs to be performed on the node, e.g., whether the node need to be deleted, whether the node needs to be updated and how. At the end of your script you could then apply the update behavior to the node.
Perform in-memory updates to the node then commit the node to the database at the end of the transaction. Here is one library you could use https://github.com/ryanjdew/XQuery-XML-Memory-Operations
One general possibility for complicated updates: use XSLT.
This is multi transaction statement. There are multiple ways to handle it in your scenario:
Use xdmp:eval
Use mem library of MarkLogic to replace your nodes
Rewrite your Query to avoid transaction conflict
I'm trying to use Dart with sqlite, with this project dart-sqlite.
But I found a problem: the API it provides is synchronous style. The code will be looked like:
// Iterating over a result set
var count = c.execute("SELECT * FROM posts LIMIT 10", callback: (row) {
print("${row.title}: ${row.body}");
});
print("Showing ${count} posts.");
With such code, I can't use Dart's future support, and the code will be blocking at sql operations.
I wonder how to change the code to asynchronous style? You can see it defines some native functions here: https://github.com/sam-mccall/dart-sqlite/blob/master/lib/sqlite.dart#L238
_prepare(db, query, statementObject) native 'PrepareStatement';
_reset(statement) native 'Reset';
_bind(statement, params) native 'Bind';
_column_info(statement) native 'ColumnInfo';
_step(statement) native 'Step';
_closeStatement(statement) native 'CloseStatement';
_new(path) native 'New';
_close(handle) native 'Close';
_version() native 'Version';
The native functions are mapped to some c++ functions here: https://github.com/sam-mccall/dart-sqlite/blob/master/src/dart_sqlite.cc
Is it possible to change to asynchronous? If possible, what shall I do?
If not possible, that I have to rewrite it, do I have to rewrite all of:
The dart file
The c++ wrapper file
The actual sqlite driver
UPDATE:
Thanks for #GregLowe's comment, Dart's Completer can convert callback style to future style, which can let me to use Dart's doSomething().then(...) instead of passing a callback function.
But after reading the source of dart-sqlite, I realized that, in the implementation of dart-sqlite, the callback is not event-based:
int execute([params = const [], bool callback(Row)]) {
_checkOpen();
_reset(_statement);
if (params.length > 0) _bind(_statement, params);
var result;
int count = 0;
var info = null;
while ((result = _step(_statement)) is! int) {
count++;
if (info == null) info = new _ResultInfo(_column_info(_statement));
if (callback != null && callback(new Row._internal(count - 1, info, result)) == true) {
result = count;
break;
}
}
// If update affected no rows, count == result == 0
return (count == 0) ? result : count;
}
Even if I use Completer, it won't increase the performance. I think I may have to rewrite the c++ code to make it event-based first.
You should be able to write a wrapper without touching the C++. Have a look at how to use the Completer class in dart:async. Basically you need to create a Completer, return Completer.future immediately, and then call Completer.complete(row) from the existing callback.
Re: update. Have you seen this article, specifically the bit about asynchronous extensions? i.e. If the C++ API is synchronous you can run it in a separate thread, and use messaging to communicate with it. This could be a way to do it.
The big problem you've got is that SQLite is an embedded database; in order to process your query and provide your results, it must do computation (and I/O) in your process. What's more, in order for its transaction handling system to work, it either needs its connection to be in the thread that created it, or for you to run in serialized mode (with a performance hit).
Because these are fairly hard constraints, your plan of switching things to an asynchronous operation mode is unlikely to go well except by using multiple threads. Since using multiple connections complicates things a lot (as you can't share some things between them, such as TEMP TABLEs) let's consider going for a single serialized connection; all activity will be serialized at the DB level, but for an application that doesn't use the DB a lot it will be OK. At the C++ level, you'd be talking about calling that execute from another thread and then sending messages back to the caller thread to indicate each row and the completion.
But you'll take a real hit when you do this; in particular, you're committing to only doing one query at a time, as the technique runs into significant problems with semantic effects when you start using two connections at once and the DB forces serialization on you with one connection.
It might be simpler to do the above by putting the synchronous-asynchronous coupling at the Dart level by managing the worker thread and inter-thread communication there. That would let you avoid having to change the C++ code significantly. I don't know Dart well enough to be able to give much advice there.
Myself, I'd just stick with synchronous connection processing so that I can make my application use multi-threaded mode more usefully. I'd be taking the hit with the semantics and giving each thread its own connection (possibly allocated lazily) so that overall speed was better, but I do come from a programming community that regards threads as relatively heavyweight resources, so make of that what you will. (Heavy threads can do things that reduce the number of locks they need that it makes no sense to try to do with light threads; it's about overhead management.)
How do I create PL/SQL function which waits for update on some row for specified timeout and then returns.
What I want to accomplish is - I have long running process which will update it's status to ASYNC_PROCESS table by process_id. I need function which returns with true/false when this process has completed, but also I need this function to wait some time for this process complete, return on timeout or return imediately with true, when process has completed. I don't want to use sleep(1 sec), because in such case I will be having 1 sec lag. I don't want to use sleep(1 msec), because in such case I am spending cpu resources (and 1msec lag).
Is there a good way how experienced programmer would accomplish this?
That function will be called from .NET (So I need minimal lag between DB operation and .NET/UI)
THNX,
Beef
I think the most sensible thing to do in this case is to use update triggers on that ASYNC_PROCESS table.
You should also look into the DBMS_ALERT package. Here's an edited excerpt from that doc:
Create an alert:
DBMS_ALERT.REGISTER('emp_table_alert');
Create a trigger on your table to fire the alert:
CREATE TRIGGER emptrig AFTER INSERT ON emp
BEGIN
DBMS_ALERT.SIGNAL('emp_table_alert', 'message_text');
END;
From your .net code, you can the use something that calls this:
DBMS_ALERT.WAITONE('emp_table_alert', :message, :status, :timeout);
Make sure you read the docs for what :status and :timeout do.
You should look at Oracle Advanced Queuing. It offers the kind of functions your looking for.
You'll probably need a separate queue table where a trigger on ASYNC_PROCESS inserts messages. You then use the AQ functions to retrieve (or wait for) the next message in the queue table.
If you're doing this in C#.NET, why wouldn't you simply spawn a worker thread to do the update (via ODAC)? Why hand the responsibility over to Oracle when (it seems) you want a .NET process to make the update call (in background) and have the main process be notified of its completion.
See here and here for examples, although there are several approaches in .NET for this (delegates, events, async callbacks, thread pools, etc)