avoiding XDMP-EXPNTREECACHEFULL and loading document - xquery

I am using marklogic 4 and I have some 15000 documents (each of around 10 KB). I want to load the entire content as a document ( and convert the total documents to a single csv file and output to HTTP output stream for downloading). While I load the documents this way:
let $uri := cts:uri-match('products/documents/*.xml')
let $doc := fn:doc ($uri)
The xpath has some 15000 xmls. So fn:doc throws an error XDMP-EXPNTREECACHEFULL.
Is there any workaround for this? I cannot increase tree cache size in admin console because the number of xml files in products/documents/*.xml may increase.
Thanks.

When you want to export large quantities of XML from MarkLogic, the best technique is to write the query so that results can stream, avoiding the expanded tree cache entirely. It is a very different style of coding, though: you'll have to avoid strong typing of any kind, and refactor your code to remove FLWOR expressions. You won't be able to test any of the code in cq or qconsole, either.
Take a look at http://blakeley.com/blogofile/2012/03/19/let-free-style-and-streaming/ for some tips on how to get there. At a minimum the code sample you posted would have to become:
doc(cts:uri-match('products/documents/*.xml'))
In passing I would try to rework that to avoid the *.xml part, because it will be slower than needed. Maybe something like this?
cts:search(
collection(),
cts:directory-query('products/documents/', 'infinity'))
If you need to test for something more than the directory, you could add a cts:and-query with some cts:element-query test.

For general information about this error, see the MarkLogic knowledge base article on XDMP-EXPNTREECACHEFULL

Related

effectively accessing first item in object

On input consider db-dump(from dbeaver), having this format:
{
"select": [
{<row1>},
{<row2>}
],
"select": {}
}
say that I'm debugging bigger script, and just want to see first few rows, from first statement. How to do that effectively in rather huge file?
Template:
jq 'keys[0] as $k|.[$k]|limit(1;.[])' dump
isn't really great, as it need to fetch all keys first. Template
jq '.[0]|limit(1;.[])' dump
sadly does not seem to be valid one, and
jq 'first(.[])|limit(1;.[])' dump
does not seem to have any performance benefit.
What would be the best way to just access first field in object without actually testing it's name or caring for rest of fields?
One strategy would be to use the —stream command-line option. It’s a bit tricky to use, but if you want to use jq or gojq, it’s the way to go for a space-time efficient solution for a large input.
Far easier to use would be my jm script, which is intended precisely to achieve the kind of objective you describe. In particular, please note its —-limit option. E.g. you could start with:
jm -s —-limit 1
See
https://github.com/pkoppstein/jm
How to read a 100+GB file with jq without running out of memory
Given that weird object with identical keys, you can use the --stream option to access all items before the JSON processor would eliminate the duplicates, fromstream and truncate_stream to dissect the input, and limit to reduce the output to just a few items:
jq --stream -cn 'limit(5; fromstream(2|truncate_stream(inputs)))' dump.json
{<row1>}
{<row2>}
{<row3>}
{<row4>}
{<row5>}

is there a way to read the contents of the last jupyter markdown cell as a string?

I'm using jupyter and pandas read_sql, this works fine but looks ugly.
for instance I have a query:
SELECT *
FROM table_a AS a
LIMIT 10;
I could show it nicely in a markdown cell as so:
``` mysql
SELECT *
FROM table_a AS a
LIMIT 10;
```
and I could execute it in a code cell as so:
pd.read_sql('SELECT * FROM table_a AS a LIMIT 10;', conn)
this involves copy/paste and displaying the query twice (not too good if I want to simply export my notebook to a pdf report)
is there a way to avoid the duplication by reading the markdown text into a string python variable, or any other way?
The cellmagic answer cited by #Micah Kornfield in the question comments may be a good fit for many situations. In the question however it is said that it is desirable to avoid duplicates. Let's imagine that the SQL is huge and we don't want see the same query more than once.
Unfortunatelly right now in 2021 there's no easy solution for this. In a jupyter notebook there are two worlds, the backend which is the kernel and in our case runs python, and the frontend which runs javascript. Only javascript sees the markdown cells. It is possible to make the backend and frontend communicate with each other, those methods are usually a little hacky, but anyway we will rely on some of them.
I have written a script that does our job in two different ways, which will probably bring similar results. I will call those methods the file read method and the javascript method.
First, please save the following file markdown.py in the same folder as the ipython (we are using a separate file because you specified that your notebook willl eventually go to a report and it is undesirable to have this script together with the notebook):
from IPython.display import Javascript
from urllib.parse import unquote
from json import loads as jsonloads
def markread(cellnumber,notebookname=None,callbackvar=None):
try:
if type(cellnumber) is int:# maybe check if (varname in globals()):
if callbackvar is not None and type(callbackvar) is str:
return Javascript("const mdtjs = Jupyter.notebook.get_cells().filter(c=>c.cell_type==\"markdown\")["+str(cellnumber)+"].get_text(); IPython.notebook.kernel.execute(\"mdtp = unquote('\"+encodeURI(mdtjs)+\"');mdtp=mdtp[mdtp.find('\\\\n',mdtp.find('```'))+1:min(mdtp.rfind('\\\\n'),mdtp.rfind('```'))].strip();"+callbackvar+"=mdtp;del mdtp\");")
if notebookname is not None and type (notebookname) is str:
if not notebookname.endswith('.ipynb'):
notebookname += '.ipynb'
with open(notebookname) as f:
j = jsonloads(f.read())
mdts = [''.join(c['source'][1:]).strip().strip('`').strip() for c in j['cells'] if c['cell_type']=='markdown']
return mdts[cellnumber]
except:
return None
return None
Now back to the notebook, to load the script, you have to import it:
from markdown import markread, unquote
The unquote is needed to use the javascript method, otherwhise you can skip it.
1. File read method:
Usage:
marktext = markread(2, notebookname='mynotebookname')
Here marktext will get the value from the third markdown cell in the mynotebookname (third because we live in a zero-indexed world, so 2 means third; if you skip '.ipynb' extension in the notebookname as in this case it will be automatically appended). Important - this method reads the notebook file writen on disk and not the hot state of things. If you changed anything since last save, things may go wrong.
2. Javascript method:
Usage:
markread(1, callbackvar='marktext')
Here we write the value of our second markdown cell to a variable called marktext. Javascript method is trickier - it is async, so we have to send the name of the variable that we want to write to (must be a string representing its name, not the variable itself). Is is important to know also that markread must be the last command in the cell due to a limitation in javascript invoking.
How it works
Internally, the file read method just reads the notebook file which is json, picks the value from 'cells' and filters out the ones which are markdown.
The javascript method however is more complex. It invokes JS because JS has access to the cells including markdown, so JS reads cells values (from the Jupyter.notebook.get_cells), filters the markdown ones, invoke python back and send back those markdown cells - url enconded. Those encoded cells are decoded back and assigned to the callbackvar. In both methods I made some assumptions that may not be correct about trimming the start and the end of the cell value (the ``` and whitespaces).
There are ways to improve the code, for example making it auto detect the notebook name for the file read method, but it involves even more hacks, relying again on javascript to get the notebook name or making an call to the api on port 8888, but having to deal with session password. I believe the most important is covered already by our script. If one method does not work, you will probably still have the other.

Marklogic - Delete Versioned Collections

I have around 43 million documents which is having the latest versioned document in LIVE collection and also have same versioned document in another version collection named as (/collection/versionNumber). I want to delete the versioned collections which is around 34 million. what is best approach to go for it to delete all in one go .
You could try using xdmp:collection-delete() to delete all documents in the collection in a single transaction.
If that doesn't work and it isn't able to delete in one shot, then I would look to utilize batch tools. For instance, a CoRB job.
An example job options file with properties needed, except for the XCC-CONNECTION-URI:
# Inline module to select all URIs from the collection
URIS-MODULE=INLINE-XQUERY|let $uris := cts:uris("",(),cts:collection-query("/collection/versionNumber")) return (count($uris), $uris)
# Inline module to delete the docs
PROCESS-MODULE=INLINE-XQUERY|declare variable $URI as xs:string external; xdmp:document-delete($URI)
THREAD-COUNT=10
I think your application is using DLS library for versioning. If yes, and if you never want any version to look into in future, then only delete the versioned documents. Use can use "dls:document-unmanage" API in that case.
Also, explore dls:purge and dls:document-purge before proceeding. I am not very sure of these two.
Anyways, even if it's not DLS, processing them in one go (single transaction) would not be a recommended way. Either process them in batches or set them all in different threads on task server through spawn.

Xquery optimization

I have this xquery as follows:
declare variable $i := doc()/some-element/modifier[empty(modifier-value)];
$i[1]/../..;
I need to run this query on Marklogic's Qconsole where we have 721170811 records. Since that is huge number of record, I am getting timeout error. Is there any way I can optimize this query to get the result?
P.S. I cannot request amdin to increase the timeout time.
Try creating an element range index (or a path range index if the target element is not unique) and using a cts:values() lexicon lookup.
That way, the request can read the values from the range index instead of having to read each document.
See:
http://docs.marklogic.com/guide/search-dev/lexicon
You could use xdmp:spawn, create a library when you will make the query, get the documents, iterate the result collecting 1000 documents per iteration and call another xdmp:spawn to process the information from that dataset, I would suggest summarize the result to return only the information you will need to don't crash the browser, at the end should look something like this:
xdmp:spawn("process.xqy")
into the library process.xqy
function local:start-process(){
let $docs := (....)
let $temp := for $x in $docs[$start to $end]
return local:process-dataset($temp) (: Could use spawn here too if you want :)
return xdmp:spawn("collect.xqy",$temp)
}
local:start-process()
compact-data function should create a file or a set of files with your data, this way the server will run all the process and in some minutes you will be available to see your data without problems.
You don't want to run something like doc() or xdmp:directory - just returns a result set that will kill you every time. You need to lower your result set by a lot.
A few thoughts:
You want to have as much done in MarkLogic's d-node, and the least work done in the e-node as possible. This is a way over-generalization, but for the most part I look at it like d-node stuff is data, indexes, lexicon work, etc. e-node stuff handles xQuery and such. So, in your example, you're definitely working out the e-node more than you need to.
You're going to want to use cts:search, as it uses indexes, not xPath to resolve your query. So, something like this:
declare variable $i := cts:search(fn:collection(),
cts:element-query(xs:QName("some-element"),
cts:element-value-query(xs:QName("modifier"), "", "exact")
)
)[1];
This will return document-node's, which it looks like what you were wanting with the $i[1]/../... This searches the xPath some-element for a modifier that is empty.
Please create element range index and attribute range index and use cts:search if you are familiar with marklogic it will be easy for you to write the query.

Write directly to file from BaseX GUI

I wrote an XQuery expression that has a large result of about 50MB and takes a couple of hours to compute. I execute it in the BaseX GUI, but this is a little inconvenient: it crops the result to a result window, which I then have to save. At this time, BaseX becomes unresponsive and may crash.
Is there a way to directly write the result to a file?
Have a look at BaseX' file module, which provides broad functionality to read and write from files and traverse the file system.
For you, file:write($path as xs:string, $items as item()*) as empty-sequence() will be of special interest, which allows to write an element sequence to a file. For example:
file:write(
'/tmp/output.xml',
<root>{
for $i in 1 to 1000000
return <some-large-amount-of-data />
}</root>
)
If your output isn't well-formed XML, consider the file:write-binary, file:write-text and file:write-text-lines functions.
Yet another alternative might be writing to documents in the database instead of files. db:add and db:create from the database module can be used to add the computed results to the current or a new database.

Resources