I wrote an XQuery expression that has a large result of about 50MB and takes a couple of hours to compute. I execute it in the BaseX GUI, but this is a little inconvenient: it crops the result to a result window, which I then have to save. At this time, BaseX becomes unresponsive and may crash.
Is there a way to directly write the result to a file?
Have a look at BaseX' file module, which provides broad functionality to read and write from files and traverse the file system.
For you, file:write($path as xs:string, $items as item()*) as empty-sequence() will be of special interest, which allows to write an element sequence to a file. For example:
file:write(
'/tmp/output.xml',
<root>{
for $i in 1 to 1000000
return <some-large-amount-of-data />
}</root>
)
If your output isn't well-formed XML, consider the file:write-binary, file:write-text and file:write-text-lines functions.
Yet another alternative might be writing to documents in the database instead of files. db:add and db:create from the database module can be used to add the computed results to the current or a new database.
Related
I'm using jupyter and pandas read_sql, this works fine but looks ugly.
for instance I have a query:
SELECT *
FROM table_a AS a
LIMIT 10;
I could show it nicely in a markdown cell as so:
``` mysql
SELECT *
FROM table_a AS a
LIMIT 10;
```
and I could execute it in a code cell as so:
pd.read_sql('SELECT * FROM table_a AS a LIMIT 10;', conn)
this involves copy/paste and displaying the query twice (not too good if I want to simply export my notebook to a pdf report)
is there a way to avoid the duplication by reading the markdown text into a string python variable, or any other way?
The cellmagic answer cited by #Micah Kornfield in the question comments may be a good fit for many situations. In the question however it is said that it is desirable to avoid duplicates. Let's imagine that the SQL is huge and we don't want see the same query more than once.
Unfortunatelly right now in 2021 there's no easy solution for this. In a jupyter notebook there are two worlds, the backend which is the kernel and in our case runs python, and the frontend which runs javascript. Only javascript sees the markdown cells. It is possible to make the backend and frontend communicate with each other, those methods are usually a little hacky, but anyway we will rely on some of them.
I have written a script that does our job in two different ways, which will probably bring similar results. I will call those methods the file read method and the javascript method.
First, please save the following file markdown.py in the same folder as the ipython (we are using a separate file because you specified that your notebook willl eventually go to a report and it is undesirable to have this script together with the notebook):
from IPython.display import Javascript
from urllib.parse import unquote
from json import loads as jsonloads
def markread(cellnumber,notebookname=None,callbackvar=None):
try:
if type(cellnumber) is int:# maybe check if (varname in globals()):
if callbackvar is not None and type(callbackvar) is str:
return Javascript("const mdtjs = Jupyter.notebook.get_cells().filter(c=>c.cell_type==\"markdown\")["+str(cellnumber)+"].get_text(); IPython.notebook.kernel.execute(\"mdtp = unquote('\"+encodeURI(mdtjs)+\"');mdtp=mdtp[mdtp.find('\\\\n',mdtp.find('```'))+1:min(mdtp.rfind('\\\\n'),mdtp.rfind('```'))].strip();"+callbackvar+"=mdtp;del mdtp\");")
if notebookname is not None and type (notebookname) is str:
if not notebookname.endswith('.ipynb'):
notebookname += '.ipynb'
with open(notebookname) as f:
j = jsonloads(f.read())
mdts = [''.join(c['source'][1:]).strip().strip('`').strip() for c in j['cells'] if c['cell_type']=='markdown']
return mdts[cellnumber]
except:
return None
return None
Now back to the notebook, to load the script, you have to import it:
from markdown import markread, unquote
The unquote is needed to use the javascript method, otherwhise you can skip it.
1. File read method:
Usage:
marktext = markread(2, notebookname='mynotebookname')
Here marktext will get the value from the third markdown cell in the mynotebookname (third because we live in a zero-indexed world, so 2 means third; if you skip '.ipynb' extension in the notebookname as in this case it will be automatically appended). Important - this method reads the notebook file writen on disk and not the hot state of things. If you changed anything since last save, things may go wrong.
2. Javascript method:
Usage:
markread(1, callbackvar='marktext')
Here we write the value of our second markdown cell to a variable called marktext. Javascript method is trickier - it is async, so we have to send the name of the variable that we want to write to (must be a string representing its name, not the variable itself). Is is important to know also that markread must be the last command in the cell due to a limitation in javascript invoking.
How it works
Internally, the file read method just reads the notebook file which is json, picks the value from 'cells' and filters out the ones which are markdown.
The javascript method however is more complex. It invokes JS because JS has access to the cells including markdown, so JS reads cells values (from the Jupyter.notebook.get_cells), filters the markdown ones, invoke python back and send back those markdown cells - url enconded. Those encoded cells are decoded back and assigned to the callbackvar. In both methods I made some assumptions that may not be correct about trimming the start and the end of the cell value (the ``` and whitespaces).
There are ways to improve the code, for example making it auto detect the notebook name for the file read method, but it involves even more hacks, relying again on javascript to get the notebook name or making an call to the api on port 8888, but having to deal with session password. I believe the most important is covered already by our script. If one method does not work, you will probably still have the other.
I'm very new to this.
I have a query and an xml file.
I can write a query over that specific file
for $x in doc("file:///C:/Users/Foo/IdeaProjects/XQuery/src/books.xml")/bookstore/book
where $x/price>30
order by $x/title
return $x/title
I have a basic xml file, with books in it, works nicely in intellij.
but if I wanted to run this query against some file defined on the command line, then how do I do it?
the command line for running the above is (as much for other peoples reference)
java -cp C:\Users\Foo\.IdeaIC2019.2\config\plugins\xquery-intellij-plugin\lib\Saxon-HE-9.9.1-7.jar net.sf.saxon.Query -t -q:"C:\Users\Foo\IdeaProjects\XQuery\src\w3schools.com.xqy"
and that also works nicely.
the saxon documentation
https://www.saxonica.com/html/documentation/using-xquery/commandline.html
implies that I can specify an input file, using "-d"
and "The document node of the document is made available to the query as the context item"
but this doesnt really make any sense to my 1 day old XQuery skills.
how do I specify the document is sent from the command line in the query? what is the context item? and how do I reference it?
(I can do a bit of XSLT 1.0, so I understand the notion of a context).
I think the option is named -s (for source) so you can use -s:books.xml and inside your XQuery main expression any path is evaluated with that document as the context item so you can just use e.g.
for $x in /bookstore/book
where $x/price>30
order by $x/title
return $x/title
and the answer is to drop the doc() function
for $x in bookstore/book
i.e. the same notion as xslt.
I have only just started with MarkLogic and XQuery. I am having a really tough time in modifying the content of one of my XML documents. I just cannot seem to get a change to an element to pick up. Here's my process (I have had to take things back as basic as I could just to try and get it working):
In query console I have one tab open which queries for the contents of one XML doc:
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
xdmp:document-get("C:/Users/Paul/Documents/MarkLogic/xml/ppl/ppl/jdbc_ppl_3790.xml")
This brings back the document as below
false
...
3790
Victoria Wilson
</ppl_name>
I now want to update the element using XQuery but it's just not happening. Here's the XQuery:
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
let $docxml :=
xdmp:document-get("C:/Users/Paul/Documents/MarkLogic/xml/ppl/ppl/jdbc_ppl_3065.xml")/document/meta/ppl_name
return
for $node in $docxml/*
let $target := xdmp:document-get("C:/Users/Paul/Documents/MarkLogic/xml/ppl/ppl/jdbc_ppl_3790.xml")/document/meta/*[fn:name() = fn:name($node)]
return
xdmp:node-replace($target, $node)
I am basically looking to replace the ppl_name element in the target (3790) with the ppl_name element from the source (3065).
I run the XQuery - it completes without error (making me thing it has worked) - return value reads your query returned an empty sequence.
I then go back to the same tab as I used in step 1 and re-run the XQuery used in step 1. The doc (3790) comes back but it STILL has Victoria Wilson as the ppl_name.
The node returned by xdmp:document-get is an in-memory node from a document on the filesystem. It isn't coming from the database. You can't use xdmp:node-replace on in-memory nodes. That's only for database-resident nodes.
You can insert it using xdmp:document-insert. Then it's in the database, and you can access it using doc and update it using xdmp:node-replace. Or you can use in-memory operations to construct a new version with the changes you want.
See What are in memory elements in marklogic? for previous answers to a similar question, and more tips.
Here the node returned by xdmp:document-get is an in-memory node
If your working with in memory elements import the following module
import module namespace mem = "http://xqdev.com/in-mem-update" at "/MarkLogic/appservices/utils/in-mem-update.xqy";
Instead of using xdmp:node-replace you can use mem:node-replace(<x/>, <y/>)
I want one script to command several computers to break up a highly distributable workload. In order to distribute the workload I put half of the task labels in one file, and half of the tasks in another file that i distribute to the computers with google drive (which is why i need different file names). So C:\googledrive\task1.txt and C:\googledrive\task2.txt
The autohotkey command looks like:
loop, read, c:\googledrive\task*.txt
But instead of reading task1.txt, it appears to try to read "task*.txt" as a literal file name, fails, and ends the loop.
Ideas? Thanks.
OK, tried ensuring everything was running with administrator rights (they are) and ensured that the files exist (they do) and no typos in the file path (everything good there). Still wont actually read the file.
There is one bit that I didn't include in the original post part of the file name is actually a variable, so the loop command is actually like:
loop, read, c:\googledrive\%task%*.txt
I just figured that bit was inconsequential.
If i save a different script for each computer, i can go ahead and replace the wildcard with the actual bit, and it works.
so... Im just going to name each file with the computer's name in the file, and change the command to:
loop, read, c:\googledrive\%task%%A_ComputerName%.txt
I do it this way....
Loop, C:\Temp\Source\*.txt ; Lists the next file as A_LoopFileName
{
Loop, read, C:\Temp\Source\%A_LoopFileName% ; process current file
{
IfInString, A_LoopReadLine, abc
{
.......
}
}
}
I am using marklogic 4 and I have some 15000 documents (each of around 10 KB). I want to load the entire content as a document ( and convert the total documents to a single csv file and output to HTTP output stream for downloading). While I load the documents this way:
let $uri := cts:uri-match('products/documents/*.xml')
let $doc := fn:doc ($uri)
The xpath has some 15000 xmls. So fn:doc throws an error XDMP-EXPNTREECACHEFULL.
Is there any workaround for this? I cannot increase tree cache size in admin console because the number of xml files in products/documents/*.xml may increase.
Thanks.
When you want to export large quantities of XML from MarkLogic, the best technique is to write the query so that results can stream, avoiding the expanded tree cache entirely. It is a very different style of coding, though: you'll have to avoid strong typing of any kind, and refactor your code to remove FLWOR expressions. You won't be able to test any of the code in cq or qconsole, either.
Take a look at http://blakeley.com/blogofile/2012/03/19/let-free-style-and-streaming/ for some tips on how to get there. At a minimum the code sample you posted would have to become:
doc(cts:uri-match('products/documents/*.xml'))
In passing I would try to rework that to avoid the *.xml part, because it will be slower than needed. Maybe something like this?
cts:search(
collection(),
cts:directory-query('products/documents/', 'infinity'))
If you need to test for something more than the directory, you could add a cts:and-query with some cts:element-query test.
For general information about this error, see the MarkLogic knowledge base article on XDMP-EXPNTREECACHEFULL