I have this xquery as follows:
declare variable $i := doc()/some-element/modifier[empty(modifier-value)];
$i[1]/../..;
I need to run this query on Marklogic's Qconsole where we have 721170811 records. Since that is huge number of record, I am getting timeout error. Is there any way I can optimize this query to get the result?
P.S. I cannot request amdin to increase the timeout time.
Try creating an element range index (or a path range index if the target element is not unique) and using a cts:values() lexicon lookup.
That way, the request can read the values from the range index instead of having to read each document.
See:
http://docs.marklogic.com/guide/search-dev/lexicon
You could use xdmp:spawn, create a library when you will make the query, get the documents, iterate the result collecting 1000 documents per iteration and call another xdmp:spawn to process the information from that dataset, I would suggest summarize the result to return only the information you will need to don't crash the browser, at the end should look something like this:
xdmp:spawn("process.xqy")
into the library process.xqy
function local:start-process(){
let $docs := (....)
let $temp := for $x in $docs[$start to $end]
return local:process-dataset($temp) (: Could use spawn here too if you want :)
return xdmp:spawn("collect.xqy",$temp)
}
local:start-process()
compact-data function should create a file or a set of files with your data, this way the server will run all the process and in some minutes you will be available to see your data without problems.
You don't want to run something like doc() or xdmp:directory - just returns a result set that will kill you every time. You need to lower your result set by a lot.
A few thoughts:
You want to have as much done in MarkLogic's d-node, and the least work done in the e-node as possible. This is a way over-generalization, but for the most part I look at it like d-node stuff is data, indexes, lexicon work, etc. e-node stuff handles xQuery and such. So, in your example, you're definitely working out the e-node more than you need to.
You're going to want to use cts:search, as it uses indexes, not xPath to resolve your query. So, something like this:
declare variable $i := cts:search(fn:collection(),
cts:element-query(xs:QName("some-element"),
cts:element-value-query(xs:QName("modifier"), "", "exact")
)
)[1];
This will return document-node's, which it looks like what you were wanting with the $i[1]/../... This searches the xPath some-element for a modifier that is empty.
Please create element range index and attribute range index and use cts:search if you are familiar with marklogic it will be easy for you to write the query.
Related
Hi,
I try to use the "cover" function from APOC like this :
WITH ["f1,"f2",...] as list1
MATCH (n:Frag)
WHERE n.frag in list1
WITH COLLECT(ID(n)) as nodeIds
CALL apoc.algo.cover(nodeIds)
YIELD rel
RETURN rel
It works but it is very slow the first time. If I do it once again, it becomes muck quicker! What does that mean?
Probably your issue is not related to apoc.algo.cover usage, but to the WHERE part of your query. You can try a performance improvement adding an index in the Frag.frag property.
CREATE INDEX ON :Frag(frag)
After creating the index run your query again. Note that the index is not immediately available, but will be created in the background.
Two questions
How to check file exists or not before EXTRACT?
we have scenario where new inputs file is generated every day for catalog data. we need to merge new input with d-1 file. before merge we what to make sure that new input file exists at source location
does u-sql supports try...catch block?
Regarding checking if a file exists. We recently released a compile-time IF statement that indeed can check for partition existence (and other objects such as files and tables are on the roadmap).
Once that feature is released (still one or two refreshs out at the time of this answer) it may look something like (syntax subject to change):
IF FILE.EXISTS("/mydir/myfile.csv") THEN
#data = EXTRACT ... FROM "/mydir/myfile.csv" USING ...;
...
#jobstate = SELECT * FROM (VALUES("job completed")) AS T(status);
ELSE
#jobstate = SELECT * FROM (VALUES("file not ready. Job not executed.")) AS T(status);
END;
OUTPUT #jobstate TO "/jobs/myjobstate.csv" USING Outputters.Csv();
You will be able to provide the name as a parameter as well. Please let me know if that will work for your scenario.
An other alternative is to use the file set syntax, especially if you want to use a dynamic value to determine the process. That would simply create an empty rowset:
#data = EXTRACT ..., date DateTime
FROM "/mydir/{date:yyyy}/{date:MM}/{date:dd}/data.csv"
USING ...;
#data = SELECT * FROM #data WHERE date == DateTime.Now.AddDays(-1);
... // continue processing #data that is empty if yesterday's file is not yet there
Having said that, you may want to check of your job orchestration framework (such as ADF) may be a better place to check for existence before submitting the job in the first place.
As to the try catch block: U-SQL itself is a script-level optimizable, declarative language where the plan gets generated and optimized at runtime over the whole script. Thus providing a dynamic TRY-CATCH is currently not available, since it would severely impact the ability to optimize the script (e.g., you cannot move predicates or column pruning outside of a try-catch block). Also TRY/CATCH can lead to some very hard to understand and debug code, especially if it is used to mimic procedural workflows in an otherwise declarative environment.
However, you can use try/catch inside your C# functions without problems if you need to catch C# runtime errors.
FILE.EXISTS() always returns True when executed locally. However, it works when executing against Azure Data Lake.
Tried MSDN example and the following returns True, True
DECLARE #filepath_good = "/Samples/Data/SearchLog.tsv";
DECLARE #filepath_bad = "/Samples/Data/zzz.tsv";
#result =
SELECT FILE.EXISTS(#filepath_good) AS exists_good,
FILE.EXISTS(#filepath_bad) AS exists_bad
FROM (VALUES (1)) AS T(dummy);
OUTPUT #result
TO "/Output/FileExists.txt"
USING Outputters.Csv();
I have Microsoft Azure Data Lake Tools for Visual Studio version 2.2.5000.0
I tried to update embedded triples in marklogic using xquery but it seems to be not working for embedded triples however the same query is working for other triples
can you tell me if there is some other option which needs to specified while performing an update on embedded triples.
The code i used is
xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics"
at "/Marklogic/semantics.xqy";
let $triples := cts:triples(sem:iri("http://smartlogic.com/document#2012-10-26_DNB.OL_(Citi)_DNB_ASA_(DNB.OL)__Model_Update.61259187.xml"),()())
for $triple in $triples
let $node := sem:database-nodes($triple)
let $replace :=
<sem:triple>
<sem:subject>http://www.example.com/products/1001_Test
</sem:subject>
{$node/sem:predicate, $node/sem:object}
</sem:triple>
return $node ! xdmp:node-replace(., $replace)
My document contains the following triple
<sem:triples xmlns:sem="http://marklogic.com/semantics">
<sem:triple>
<sem:subject>http://smartlogic.com/document#2012-10-26_DNB.OL_(Citi)_DNB_ASA_(DNB.OL)__Model_Update.61259187.xml</sem:subject>
<sem:predicate>http://www.smartlogic.com/schemas/docinfo.rdf#cik</sem:predicate>
<sem:object>datatype="http://www.w3.org/2001/XMLSchema#string</sem:object>
</sem:triple>
</sem:triples>
and i want this particular subject to change into something like this
<sem:subject>http://www.example.com/products/1001_Test</sem:subject>
But when i use the xquery to update it , it does not alter anything, the embedded triple in the documents remains the same.
Because when i tried to see if any of the results have changed to the subject i specified it returned me no results.
I used the following query to test.
SELECT *
WHERE {
<http://www.example.com/products/1001_Test> ?predicate ?object
}
You need to add the option 'all' when you ask for the database nodes backing the triple: sem:database-nodes($triple, 'all').
To be perfectly honest, I am not 100% sure why, but I think this is because your sem:triples element is not the root element of the document it appears on.
I am using marklogic 4 and I have some 15000 documents (each of around 10 KB). I want to load the entire content as a document ( and convert the total documents to a single csv file and output to HTTP output stream for downloading). While I load the documents this way:
let $uri := cts:uri-match('products/documents/*.xml')
let $doc := fn:doc ($uri)
The xpath has some 15000 xmls. So fn:doc throws an error XDMP-EXPNTREECACHEFULL.
Is there any workaround for this? I cannot increase tree cache size in admin console because the number of xml files in products/documents/*.xml may increase.
Thanks.
When you want to export large quantities of XML from MarkLogic, the best technique is to write the query so that results can stream, avoiding the expanded tree cache entirely. It is a very different style of coding, though: you'll have to avoid strong typing of any kind, and refactor your code to remove FLWOR expressions. You won't be able to test any of the code in cq or qconsole, either.
Take a look at http://blakeley.com/blogofile/2012/03/19/let-free-style-and-streaming/ for some tips on how to get there. At a minimum the code sample you posted would have to become:
doc(cts:uri-match('products/documents/*.xml'))
In passing I would try to rework that to avoid the *.xml part, because it will be slower than needed. Maybe something like this?
cts:search(
collection(),
cts:directory-query('products/documents/', 'infinity'))
If you need to test for something more than the directory, you could add a cts:and-query with some cts:element-query test.
For general information about this error, see the MarkLogic knowledge base article on XDMP-EXPNTREECACHEFULL
I am using the following XQuery to query a collection of files:
for $files in collection("/data?select=*data*.xml")
Each file in the directory has a specific name, which enables me to recognize it. I use this as the identifier, which I retrieve as follows:
let $file-id := tokenize(base-uri($files), "/")[last()]
The $file-id variable follows a certain pattern: abc-1234. The first eight characters are relevant, so I fetch them using the variable below:
let $file-link-id := substring($file-id, 1, 8)
Now, I have another collection of files, which I want to query. These files follow the same pattern in the name, because they contain connected information.
How can I use the $file-link-id to select the correct file in the second collection?
I assume I would have to include it in the second collection clause, something along the lines of ?select=$file-link-id.xml, but I am unsure of how to do this.
Maybe you could clearify your problem statement if I assume wrong here, because your problem seems to be very easy solveable (so maybe I misunderstood your problem).
So you have your correct $file-link-id and want to use it as string. If your xquery processor supports XQuery 3.0 you can use two pipes, i.e.
for $files in collection("/data?select=" || $file-link-id || ".xml")
If not, use string-join():
for $files in collection(string-join(("/data?select=", $file-link-id, ".xml"), ''))