Marklogic commit frame/return sequence guarantee - xquery

I have a simple 1 node Marklogic server that I need to purge documents daily.
The test query below selects the documents then returns a sequence which I want to do the following:
output the name of the file being extracted
ensure the directory path exists of file in #1
save a zipped version of the document to the file in #1.
Delete the document
Is this structure safe? It returns a sequence for each document to be deleted. The last item in the returned sequence deletes the document. If any of the prior steps fail, will the document still be deleted? Should I trust the engine to execute the return sequence in order given?
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
let $dateLimitAll := current-dateTime() -xs:dayTimeDuration("P1460D")
let $dateLimitSome := current-dateTime() -xs:dayTimeDuration("P730D")
for $adoc in doc()[1 to 5]
let $docDate := $adoc/Unit/created
let $uri := document-uri($adoc)
let $path:= fn:concat("d:/purge/" , $adoc/Unit/xmldatastore/state/data(), "/", fn:year-from-dateTime($docDate), "/", fn:month-from-dateTime($docDate))
let $filename := fn:concat($path, "/", $uri, ".zip")
where ( ($docDate < $dateLimitAll) or (($docDate < $dateLimitSome) and ($adoc/Unit/xmldatastore/state != "FIRMED") and ($adoc/Unit/xmldatastore/state != "SHIPPED")))
return ( $filename, xdmp:filesystem-directory-create($path, map:new(map:entry("createParents", fn:true()))), xdmp:save($filename, xdmp:zip-create(<parts xmlns="xdmp:zip"><part>{$uri}</part></parts>, doc($uri))), xdmp:document-delete($uri) )
p.s. please ignore the [1 to 5] doc limit. Added for testing.

If any of the prior steps fail, will the document still be deleted?
If there is an error in the execution of that module, the transaction will rollback and the delete from the database will be undone.
However, the directory and zip file written to the filesystem will persist and will not be deleted. The xdmp:filesystem-directory-create() and xdmp:save() functions do not rollback or get undone if a transaction rolls back.
Should I trust the engine to execute the return sequence in order given?
Not sure that it matters much, given the statement above.
Is this structure safe?
It is unclear how many documents you might be dealing with. You may find that the filter is better/faster using cts:search and some indexes to target the candidate documents. Also, even if you can select the set of documents to process faster, if there are a lot of documents, you could still exceed execution time limits.
Another approach might be to break up the work. Select the URIs of the documents that match the criteria, and then have separate query executions for each document that is responsible for saving the zip file and deleting the document from the database. This is likely to be faster, as you can process multiple documents in parallel, avoids the risk of a timeout, and in the event of an exception, allows for some items to fail without causing the entire set to fail and rollback.
Tools such as CoRB were built exactly for this type of batch work.

Related

Ignore empty datasets

I writing a U-SQL Script that sometimes ends up with a empty data set.
Today the outputter writes an empty file when that happens. I would like the outputter to not write anything when that happens. Since I will flood the ADLS with empty files...
I have tried two things so far:
IF statement - the problem here is that I do a select count(*) from the data set and I cannot do IF #COUNT > 0 since the #count is a data set and the if statement would like to have a variable.
Write a custom outputter – But I have notice that it is not the ouputter that writes the file but some other code that runs afterwards. The file gets created after the custom outputter is done.
Does anyone have any guidance?
Thanks in advance!
One method you can do is do cook your data into a table first. Then you can INSERT into the table instead of writing to a file. Empty INSERTs do not cause job failure, nor will they affect performance at runtime or future performance on the table. Let me know if you have other questions!

Mariadb SELECT not failing on lock

I’m trying to cause a ‘SELECT’ query to fail if the record it is trying to read is locked.
To simulate it I have added a trigger on UPDATE that sleeps for 20 seconds and then in one thread (Java application) I’m updating a record (oid=53) and in another thread I’m performing the following query:
“SET STATEMENT max_statement_time=1 FOR SELECT * FROM Jobs j WHERE j.oid =53”.
(Note: Since my mariadb server version is 10.2 I cannot use the “SELECT … NOWAIT” option and must use “SET STATEMENT max_statement_time=1 FOR ….” instead).
I would expect that the SELECT will fail since the record is in a middle of UPDATE and should be read/write locked, but the SELECT succeeds.
Only if I add ‘for update’ to the SELECT query the query fails. (But this is not a good option for me).
I checked the INNODB_LOCKS table during the this time and it was empty.
In the INNODB_TRX table I saw the transaction with isolation level – REPEATABLE READ, but I don’t know if it is relevant here.
Any thoughts, how can I make the SELECT fail without making it 'for update'?
Normally consistent (and dirty) reads are non-locking, they just read some sort of snapshot, depending on what your transaction isolation level is. If you want to make the read wait for concurrent transaction to finish, you need to set isolation level to SERIALIZABLE and turn off autocommit in the connection that performs the read. Something like
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SET autocommit = 0;
SET STATEMENT max_statement_time=1 FOR ...
should do it.
Relevant page in MariaDB KB
Side note: my personal preference would be to use innodb_lock_wait_timeout=1 instead of max_statement_time=1. Both will make the statement fail, but innodb_lock_wait_timeout will cause an error code more suitable for the situation.

cache intersystems command to get the last updated timestamp of a table

I want to know the last update time of a Cache Intersystems DB table. Please let me know the relevant command. I ran through their command documentation:
http://docs.intersystems.com/latest/csp/docboo/DocBook.UI.Page.cls?KEY=GTSQ_commands
But I don't see any such command there. I also tried searching through this :
http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=RSQL_currenttimestamp
Is this not the complete documentation of commands ?
Cache' does not maintain "last updated" information by default as it might introduce unnecessary performance penalty on DML operations.
You can add this field manually to every table of interest:
Property LastUpdated As %TimeStamp [ SqlComputeCode = { Set {LastUpdated}= $ZDT($H, 3) }, SqlComputed, SqlComputeOnChange = (%%INSERT, %%UPDATE) ];
This way it would keep the time of last Update/Insert for every row, but still it would not help you with Delete.
Alternatively - you can setup triggers for every DML operation that would maintain timestamp in a separate table.
Without additional coding the only way to gather this information is to scan Journal files, which is not really intended use for these and would be slow at best.

How to show inserted xml documents in order in Marklogic?

I am inserting some xml documents from the UI in Marklogic Server and at the same time showing them in a list. I want to show the documents in a order. The document first inserted should come first in a list. Second document should come at second place and so on. But Marklogic is showing them randomly without any order.
The insert order is not persisted or preserved when working with MarkLogic Server. If you want your document's insert order to be preserved the data or the data's properties will need some value upon which the server can do an "order by" clause.
for $doc in fn:doc()
order by $doc//some-aspect-of-the-xml-structure
return
$doc
The documents are indeed independent from each-other in a "shared nothing" architecture. This helps MarkLogic run much faster than some relational database approaches where "rows" share membership and ordering in a "table" and as a result have trouble clustering efficiently.
You can order documents by data of last update:
(:If uri lexicone is enabled, else you can iterate by fn:collection():)
for $uri in cts:uris((), "document")
let $updated-date := xdmp:document-get-properties($uri, fn:QName("http://marklogic.com/cpf", "last-updated"))
order by $updated-date/text()
return $uri
There is another way, without using uri lexicon:
for $doc in fn:collection()
let $uri := xdmp:node-uri($doc)
let $updated-date := xdmp:document-get-properties($uri, fn:QName("http://marklogic.com/cpf", "last-updated"))
order by $updated-date/text()
return $uri

How to find out which package/procedure is updating a table?

I would like to find out if it is possible to find out which package or procedure in a package is updating a table?
Due to a certain project being handed over (the person who handed over the project has since left) without proper documentation, data that we know we have updated always go back to some strange source point.
We are guessing that this could be a database job or scheduler that is running the update command without our knowledge. I am hoping that there is a way to find out where the source code is calling from that is updating the table and inserting the source as a trigger on that table that we are monitoring.
Any ideas?
Thanks.
UPDATE: I poked around and found out
how to trace a statement back to its
owning PL/SQL object.
In combination with what Tony mentioned, you can create a logging table and a trigger that looks like this:
CREATE TABLE statement_tracker
( SID NUMBER
, serial# NUMBER
, date_run DATE
, program VARCHAR2(48) null
, module VARCHAR2(48) null
, machine VARCHAR2(64) null
, osuser VARCHAR2(30) null
, sql_text CLOB null
, program_id number
);
CREATE OR REPLACE TRIGGER smb_t_t
AFTER UPDATE
ON smb_test
BEGIN
INSERT
INTO statement_tracker
SELECT ss.SID
, ss.serial#
, sysdate
, ss.program
, ss.module
, ss.machine
, ss.osuser
, sq.sql_fulltext
, sq.program_id
FROM v$session ss
, v$sql sq
WHERE ss.sql_address = sq.address
AND ss.SID = USERENV('sid');
END;
/
In order for the trigger above to compile, you'll need to grant the owner of the trigger these permissions, when logged in as the SYS user:
grant select on V_$SESSION to <user>;
grant select on V_$SQL to <user>;
You will likely want to protect the insert statement in the trigger with some condition that only makes it log when the the change you're interested in is occurring - on my test server this statement runs rather slowly (1 second), so I wouldn't want to be logging all these updates. Of course, in that case, you'd need to change the trigger to be a row-level one so that you could inspect the :new or :old values. If you are really concerned about the overhead of the select, you can change it to not join against v$sql, and instead just save the SQL_ADDRESS column, then schedule a job with DBMS_JOB to go off and update the sql_text column with a second update statement, thereby offloading the update into another session and not blocking your original update.
Unfortunately, this will only tell you half the story. The statement you're going to see logged is going to be the most proximal statement - in this case, an update - even if the original statement executed by the process that initiated it is a stored procedure. This is where the program_id column comes in. If the update statement is part of a procedure or trigger, program_id will point to the object_id of the code in question - you can resolve it thusly:
SELECT * FROM all_objects where object_id = <program_id>;
In the case when the update statement was executed directly from the client, I don't know what program_id represents, but you wouldn't need it - you'd have the name of the executable in the "program" column of statement_tracker. If the update was executed from an anonymous PL/SQL block, I'm not how to track it back - you'll need to experiment further.
It may be, though, that the osuser/machine/program/module information may be enough to get you pointed in the right direction.
If it is a scheduled database job then you can find out what scheduled database jobs exist and look into what they do. Other things you can do are:
look at the dependencies views e.g. ALL_DEPENDENCIES to see what packages/triggers etc. use that table. Depending on the size of your system that may return a lot of objects to trawl through.
Search all the database source code for references to the table like this:
select distinct type, name
from all_source
where lower(text) like lower('%mytable%');
Again that may return a lot of objects, and of course there will be some "false positives" where the search string appears but isn't actually a reference to that table. You could even try something more specific like:
select distinct type, name
from all_source
where lower(text) like lower('%insert into mytable%');
but of course that would miss cases where the command was formatted differently.
Additionally, could there be SQL scripts being run through "cron" jobs on the server?
Just write an "after update" trigger and, in this trigger, log the results of "DBMS_UTILITY.FORMAT_CALL_STACK" in a dedicated table.
The purpose of this function is exactly to give you the complete call stack of al the stored procedures and triggers that have been fired to reach your code.
I am writing from the mobile app, so i can't give you more detailed examples, but if you google for it you'll find many of them.
A quick and dirty option if you're working locally, and are only interested in the first thing that's altering the data, is to throw an error in the trigger instead of logging. That way, you get the usual stack trace and it's a lot less typing and you don't need to create a new table:
AFTER UPDATE ON table_of_interest
BEGIN
RAISE_APPLICATION_ERROR(-20001, 'something changed it');
END;
/

Resources