I have a big collection of about 1 million documents, in MarkLogic (version 9), with below structure:
<File>
<Id></Id>
<ModifiedAt></ModifiedAt>
<Author></Author>
<Title></Title>
</File>
And I need to iterate through entire collection and to replace space from ModifiedAt with T for all documents
Example of document:
<File>
<Id>12121</Id>
<ModifiedAt>2011-06-08 14:29:29.000</ModifiedAt>
<Author>Test</Author>
<Title>Test</Title>
</File>
Field ModifiedAt should become: 2011-06-08T14:29:29.000
The code looks like:
for $doc in fn:collection('File')
return xdmp:node-replace($doc/File/ModifiedAt,<ModifiedAt>{fn:replace($doc/File/ModifiedAt,' ','T')}</ModifiedAt>)
The issue is that this code returns timeout.
I assume there is a more elegant way to make this modification on entire collection and maybe someone has a hint.
Thank you!
There are various external tools out there, like Corb2, and MLCP that can be used for this, but you can also do adhoc or less adhoc work from inside MarkLogic. All you essentially need to do is do your processing in batches. Taskbot is very useful for that:
https://github.com/mblakele/taskbot
HTH!
Related
I have this xquery as follows:
declare variable $i := doc()/some-element/modifier[empty(modifier-value)];
$i[1]/../..;
I need to run this query on Marklogic's Qconsole where we have 721170811 records. Since that is huge number of record, I am getting timeout error. Is there any way I can optimize this query to get the result?
P.S. I cannot request amdin to increase the timeout time.
Try creating an element range index (or a path range index if the target element is not unique) and using a cts:values() lexicon lookup.
That way, the request can read the values from the range index instead of having to read each document.
See:
http://docs.marklogic.com/guide/search-dev/lexicon
You could use xdmp:spawn, create a library when you will make the query, get the documents, iterate the result collecting 1000 documents per iteration and call another xdmp:spawn to process the information from that dataset, I would suggest summarize the result to return only the information you will need to don't crash the browser, at the end should look something like this:
xdmp:spawn("process.xqy")
into the library process.xqy
function local:start-process(){
let $docs := (....)
let $temp := for $x in $docs[$start to $end]
return local:process-dataset($temp) (: Could use spawn here too if you want :)
return xdmp:spawn("collect.xqy",$temp)
}
local:start-process()
compact-data function should create a file or a set of files with your data, this way the server will run all the process and in some minutes you will be available to see your data without problems.
You don't want to run something like doc() or xdmp:directory - just returns a result set that will kill you every time. You need to lower your result set by a lot.
A few thoughts:
You want to have as much done in MarkLogic's d-node, and the least work done in the e-node as possible. This is a way over-generalization, but for the most part I look at it like d-node stuff is data, indexes, lexicon work, etc. e-node stuff handles xQuery and such. So, in your example, you're definitely working out the e-node more than you need to.
You're going to want to use cts:search, as it uses indexes, not xPath to resolve your query. So, something like this:
declare variable $i := cts:search(fn:collection(),
cts:element-query(xs:QName("some-element"),
cts:element-value-query(xs:QName("modifier"), "", "exact")
)
)[1];
This will return document-node's, which it looks like what you were wanting with the $i[1]/../... This searches the xPath some-element for a modifier that is empty.
Please create element range index and attribute range index and use cts:search if you are familiar with marklogic it will be easy for you to write the query.
Reading the docs http://exist-db.org/exist/apps/doc/indexing.xml
I'm finding difficult to understand how and if I can improve the performances of a 'read' query (with 2 parameters: a string and an integer).
Do eXist-db have a default structural index? Can I improve a 2 params query with a 'range index'?
More details about my XML db (note there are 2 different dbs simply merged on the same root):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<db>
<docs>
<doc>
<header>
<year>2001</year>
<number>1</number>
<type>O</type>
</header>
<metas>
<meta>
<number>26001</number>
<details>
<detail>
<description>legge</description>
<number>19</number>
<date>14/01/1994</date>
</detail>
<detail>
<description>decreto legge</description>
<number>453</number>
<date>15/11/1993</date>
</detail>
</details>
</meta>
</metas>
</doc>
<doc>
<header>
<year>2001</year>
<number>2</number>
<type>O</type>
</header>
<metas>
<meta>
<number>26002</number>
<details>
<detail>
<description>decreto legislativo</description>
<number>29</number>
<date>03/02/1993</date>
</detail>
</details>
</meta>
<meta>
<number>26016</number>
<details>
<detail>
<description>decreto legislativo</description>
<number>29</number>
<date>03/02/1993</date>
</detail>
</details>
</meta>
</metas>
</doc>
</docs>
<full_text_docs>
<doc>
<header>
<year>2001</year>
<number>1</number>
<type>O</type>
<president>ferrari</president>
</header>
<text>lorem ipsum ...
</text>
</doc>
<doc>
<header>
<year>2001</year>
<number>2</number>
<type>O</type>
<president>ferrari</president>
</header>
<text>lorem ipsum......
</text>
</doc>
</full_text_docs>
</db>
This is my xquery
xquery version "3.0";
let $doc := doc("/db//index_test/test_general.xml")//db/docs/doc
let $fulltxt := doc("/db//index_test/test_general.xml")//db/full_text_docs/doc
return <root> {
for $a in $doc[metas/meta/details/detail[date="03/02/1993" and number = "29"]]/header
return $fulltxt[header/year/text()=$a/year/text() and
header/number/text()=$a/number/text() and
header/type/text()=$a/type/text()
]
} </root>
Basically I simply find for the detail/number and detail/date that matches the input in the first db and take the results for querying the second db. The results are all the <full_text_header> documents that matches.
I would to know if I can create indexes for the fields number and date to improve performance. Note this is the ONLY query I need to optimize (the only I do on this db) obviously number and date changes :).
SOLUTION:
For a clear explanation read the joewiz answer. My problem was the correct recognition of the .xconf file. It have to be placed in /db/yourcollectiondir. If you're using eXide when you create the file you should select Xml type with template "eXist-db collection configuration". When you try to save the file you will see a prompt "Apply configuration?" then click 'ok'. Just then run this xquery xmldb:reindex('/db/yourcollectiondir').
Now if all it's right when you run an xquery involving an index you will see the usage in "Monitoring and profiling".
As that documentation page states, eXist does create a structural index for all XML stored in the database. This is not an index of values, though, so without further indexes, queries based on value (rather than structure) would involve a lookup of values in the DOM. As your data grows larger, looking up values in the DOM gets slower and slower. This is where value-based indexes, such a range index, saves the day. (For a fuller explanation, see the "Indexing" section of Wolfgang Meier's "Tuning the Database" article, which is essential for getting the most performance out of eXist.)
So, yes, you can create indexes for the <number> and <date> fields. I'd recommend the "new range" index, as described on that documentation page. Your collection.xconf file setting up these indexes would look like this:
<collection xmlns="http://exist-db.org/collection-config/1.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<index>
<range>
<create qname="number" type="xs:integer"/>
<create qname="date" type="xs:string"/>
</range>
</index>
</collection>
You have to store this within the /db/system/config/ collection, in a subcollection corresponding to the location of your data in the database. So if your data is located in /db/apps/myapp/data, you would place this collection.xconf file in /db/system/config/db/apps/myapp/data.
Note that the configuration here would only affect the for clause's queries of date and number values, and not the predicates in the return clause, which depend on the values of <year> and <type> elements. So, to ensure your query maximized the use of indexes, you should declare indexes on these; it seems that xs:integer would be the appropriate type for each.
Lastly, I would suggest eliminating the /text() steps, which are completely extraneous. For more on the use/abuse of text(), see Evan Lenz's article, "text() is a code smell".
Update (2016-07-17): With the updated code sample above, I have a couple of additional suggestions. First, since the code is in /db/index_test, we will store our files as follows:
Assuming you're using eXide, when you store the collection.xconf file in a collection, eXide will prompt you to have a copy of the file placed in the correct location in /db/system/config. If you're not using eXide, you need to store the collection.xconf file there yourself.
Using the unmodified query, I can confirm that despite the presence of the collection.xconf file, monex shows no indexes are being applied:
Let's make a few modifications to the file to ensure indexes are properly applied:
xquery version "3.0";
<root> {
for $a in doc("/db/index_test/test_general.xml")//detail[date = "03/02/1993" and number = 29]/ancestor::doc/header
return
doc("/db/index_test/test_general.xml")/db/full_text_docs/doc
[
header/year = $a/year and
header/number = $a/number and
header/type = $a/type
]
} </root>
With these modifications, monex shows that indexes are applied to the comparisons in the for clause:
The insights here are derived from the "Tuning the Database" article. To get full indexing for all comparisons, you will need to define additional indexes and may need to make similar modifications to your query.
One final note: the version of monex you see in these pictures is using a feature I added this weekend, called "Tare", which tries to filter out other operations from the query profiling results in order to help the user see just the effects of their own query. This feature is still just a pull request, so running the current release version, you won't see identical results.
I wrote an XQuery expression that has a large result of about 50MB and takes a couple of hours to compute. I execute it in the BaseX GUI, but this is a little inconvenient: it crops the result to a result window, which I then have to save. At this time, BaseX becomes unresponsive and may crash.
Is there a way to directly write the result to a file?
Have a look at BaseX' file module, which provides broad functionality to read and write from files and traverse the file system.
For you, file:write($path as xs:string, $items as item()*) as empty-sequence() will be of special interest, which allows to write an element sequence to a file. For example:
file:write(
'/tmp/output.xml',
<root>{
for $i in 1 to 1000000
return <some-large-amount-of-data />
}</root>
)
If your output isn't well-formed XML, consider the file:write-binary, file:write-text and file:write-text-lines functions.
Yet another alternative might be writing to documents in the database instead of files. db:add and db:create from the database module can be used to add the computed results to the current or a new database.
I have only just started with MarkLogic and XQuery. I am having a really tough time in modifying the content of one of my XML documents. I just cannot seem to get a change to an element to pick up. Here's my process (I have had to take things back as basic as I could just to try and get it working):
In query console I have one tab open which queries for the contents of one XML doc:
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
xdmp:document-get("C:/Users/Paul/Documents/MarkLogic/xml/ppl/ppl/jdbc_ppl_3790.xml")
This brings back the document as below
false
...
3790
Victoria Wilson
</ppl_name>
I now want to update the element using XQuery but it's just not happening. Here's the XQuery:
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
let $docxml :=
xdmp:document-get("C:/Users/Paul/Documents/MarkLogic/xml/ppl/ppl/jdbc_ppl_3065.xml")/document/meta/ppl_name
return
for $node in $docxml/*
let $target := xdmp:document-get("C:/Users/Paul/Documents/MarkLogic/xml/ppl/ppl/jdbc_ppl_3790.xml")/document/meta/*[fn:name() = fn:name($node)]
return
xdmp:node-replace($target, $node)
I am basically looking to replace the ppl_name element in the target (3790) with the ppl_name element from the source (3065).
I run the XQuery - it completes without error (making me thing it has worked) - return value reads your query returned an empty sequence.
I then go back to the same tab as I used in step 1 and re-run the XQuery used in step 1. The doc (3790) comes back but it STILL has Victoria Wilson as the ppl_name.
The node returned by xdmp:document-get is an in-memory node from a document on the filesystem. It isn't coming from the database. You can't use xdmp:node-replace on in-memory nodes. That's only for database-resident nodes.
You can insert it using xdmp:document-insert. Then it's in the database, and you can access it using doc and update it using xdmp:node-replace. Or you can use in-memory operations to construct a new version with the changes you want.
See What are in memory elements in marklogic? for previous answers to a similar question, and more tips.
Here the node returned by xdmp:document-get is an in-memory node
If your working with in memory elements import the following module
import module namespace mem = "http://xqdev.com/in-mem-update" at "/MarkLogic/appservices/utils/in-mem-update.xqy";
Instead of using xdmp:node-replace you can use mem:node-replace(<x/>, <y/>)
I am using marklogic 4 and I have some 15000 documents (each of around 10 KB). I want to load the entire content as a document ( and convert the total documents to a single csv file and output to HTTP output stream for downloading). While I load the documents this way:
let $uri := cts:uri-match('products/documents/*.xml')
let $doc := fn:doc ($uri)
The xpath has some 15000 xmls. So fn:doc throws an error XDMP-EXPNTREECACHEFULL.
Is there any workaround for this? I cannot increase tree cache size in admin console because the number of xml files in products/documents/*.xml may increase.
Thanks.
When you want to export large quantities of XML from MarkLogic, the best technique is to write the query so that results can stream, avoiding the expanded tree cache entirely. It is a very different style of coding, though: you'll have to avoid strong typing of any kind, and refactor your code to remove FLWOR expressions. You won't be able to test any of the code in cq or qconsole, either.
Take a look at http://blakeley.com/blogofile/2012/03/19/let-free-style-and-streaming/ for some tips on how to get there. At a minimum the code sample you posted would have to become:
doc(cts:uri-match('products/documents/*.xml'))
In passing I would try to rework that to avoid the *.xml part, because it will be slower than needed. Maybe something like this?
cts:search(
collection(),
cts:directory-query('products/documents/', 'infinity'))
If you need to test for something more than the directory, you could add a cts:and-query with some cts:element-query test.
For general information about this error, see the MarkLogic knowledge base article on XDMP-EXPNTREECACHEFULL