BaseX: Slow XQuery - xquery

I've got a BaseX XML database with ~20 XML files. These files are different in size and structure. The biggest file has got 524 MB. It consists of a parent ARTICLE tag with 267685 ART subtags.
This is my XQuery: "/ARTICLE/ART[PRDNO=12345]" (pretty straightforward; proper namespaces omitted for clarity). PRDNO is a foreign key to the PRODUCT/PRD XML structure, there are multiple (in average ~10) products per article.
Everything works as it is supposed to, but this query is quite slow - it takes approximately 1s to execute. Similar queries for other objects in the database (where the data amount is smaller) are much faster.
What can I do to optimize this query?
I ran "optimize" (which took some minutes), I ensured the TEXT index is in place.
This is the output of "info database":
> info database
Database Properties
Name: hospindex
Size: 1740 MB
Nodes: 69360063
Documents: 22
Binaries: 0
Timestamp: 2014-09-03-09-34-07
Resource Properties
Timestamp: 2014-09-03-09-21-14
Encoding: UTF-8
CHOP: true
Indexes
Up-to-date: true
TEXTINDEX: true
ATTRINDEX: true
FTINDEX: false
LANGUAGE: English
STEMMING: false
CASESENS: false
DIACRITICS: false
STOPWORDS:
UPDINDEX: false
MAXCATS: 100
MAXLEN: 96
EDIT: This is the query execution plan:
Compiling:
- adding text() step
Query:
/*:ARTICLE/*:ART[*:PRDNO=1005935]
Optimized Query:
(db:open-pre("hospindex",0), db:open-pre("hospindex",32884731), ...)/*:ARTICLE/*:ART[(*:PRDNO/text() = 1005935)]
Result:
- Hit(s): 1 Item
- Updated: 0 Items
- Printed: 2078 Bytes
- Read Locking: local [hospindex]
- Write Locking: none
Timing:
- Parsing: 1.12 ms
- Compiling: 0.46 ms
- Evaluating: 1684.35 ms
- Printing: 0.35 ms
- Total Time: 1686.3 ms
Query plan:
<QueryPlan>
<IterPath>
<DBNodeSeq size="22">
<DBNode name="hospindex" pre="0"/>
<DBNode name="hospindex" pre="32884731"/>
<DBNode name="hospindex" pre="33685448"/>
<DBNode name="hospindex" pre="38260847"/>
<DBNode name="hospindex" pre="38358876"/>
</DBNodeSeq>
<IterStep axis="child" test="*:ARTICLE"/>
<IterStep axis="child" test="*:ART">
<CmpG op="=">
<CachedPath>
<IterStep axis="child" test="*:PRDNO"/>
<IterStep axis="child" test="text()"/>
</CachedPath>
<Int value="1005935" type="xs:integer"/>
</CmpG>
</IterStep>
</IterPath>
</QueryPlan>

Your query will be evaluated much faster when using quotes around your search value:
/ARTICLE/ART[PRDNO = "12345"]
The reason is that the current version of BaseX does not provide a numeric value index (it may be included in BaseX 8.0).
You get more insight into the query compilation steps by turning on the QUERYINFO option.

Related

Compression without dictionary

I have been testing the various compression algorithms with parquet files, and have settled on Zstd.
Now as far as I understand Zstd uses adaptive dictionary unless one is explicitly specified, thus it begins with an empty one. However when having a dictionary enabled the compressed size and and the execution time are quite unsatisfactory.
The file size without using a dictionary is quite less compared to using the adaptive one. (The number at the end of the name is the compression level):
Name: C:\ParquetFiles\Zstd1 Execution time: 279 ms Size: 13738134
Name: C:\ParquetFiles\Zstd2 Execution time: 140 ms Size: 13207017
Name: C:\ParquetFiles\Zstd9 Execution time: 511 ms Size: 12701030
And for comparison the log from using the adaptive dictionary:
Name: C:\ParquetFiles\ZstdDictZstd1 Execution time: 487 ms Size: 19462825
Name: C:\ParquetFiles\ZstdDictZstd2 Execution time: 402 ms Size: 19292513
Name: C:\ParquetFiles\ZstdDictZstd9 Execution time: 614 ms Size: 19072779
Can you help me understand the significance of this, shouldn't the output with an empty dictionary perform at least as good as Zstd compression with dictionary disabled?

Optimizing `data.table::fread` speed advice

I am experiencing read speeds that I believe are much slower than should be expected when trying to read a fairly large file in R with fread.
The file is ~60m rows x 147 columns, out of which I am only selecting 27 columns, directly in the fread call using select; only 23 of the 27 are found in the actual file. (Probably I inputted some of the strings incorrectly but I guess that matters less.)
data.table::fread("..\\TOI\\TOI_RAW_APextracted.csv",
verbose = TRUE,
select = cols2Select)
The system being used is an Azure VM with a 16-core Intel Xeon and 114 GB of RAM, running Windows 10.
I'm also using R 3.5.2, RStudio 1.2.1335 and data.table 1.12.0
I should also add that the file is a csv file that I have transferred onto the local drive of the VM, so there is no network / ethernet involved. I am not sure how Azure VMs work and what drives they use, but I would assume it's something equivalent to an SSD. Nothing else is running / being processed on the VM at the same time.
Please find below the verbose output of fread:
omp_get_max_threads() = 16 omp_get_thread_limit() = 2147483647 DTthreads = 0 RestoreAfterFork = true Input contains no \n. Taking this to be a filename to open [01] Check arguments Using 16 threads (omp_get_max_threads()=16, nth=16) NAstrings = [<<NA>>] None of the NAstrings look like numbers. show progress = 1 0/1 column will be read as integer [02] Opening the file Opening file ..\TOI\TOI_RAW_APextracted.csv File opened, size = 49.00GB (52608776250 bytes). Memory mapped ok [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal. [05] Skipping initial rows if needed Positioned on line 1 starting: <<"POLNO","ProdType","ProductCod>> [06] Detect separator, quoting rule, and ncolumns Detecting sep automatically ... sep=',' with 100 lines of 147 fields using quote rule 0 Detected 147 columns on line 1. This line is either column names or first data row. Line starts as: <<"POLNO","ProdType","ProductCod>> Quote rule picked = 0 fill=false and the most number of columns found is 147 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 100 because (52608776248 bytes from row 1 to eof) / (2 * 85068 jump0size) == 309216 Type codes (jump 000) : A5AA5555A5AA5AAAA57777777555555552222AAAAAA25755555577555757AA5AA5AAAAA5555AAA2A...2222277555 Quote rule 0 Type codes (jump 001) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777555577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 002) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 003) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 010) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 031) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 098) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 100) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 10045 sample rows ===== Sampled 10045 rows (handled \n inside quoted fields) at 101 jump points Bytes from first data row on line 2 to the end of last row: 52608774311 Line length: mean=956.51 sd=35.58 min=823 max=1063 Estimated number of rows: 52608774311 /
956.51 = 55000757 Initial alloc = 60500832 rows (55000757 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
===== [08] Assign column names [09] Apply user overrides on column types After 0 type and 124 drop user overrides : 05000005A0005AA0A0000770000077000A000A00000000770700000000000000A00A000000000000...0000000000 [10] Allocate memory for the datatable Allocating 23 column slots (147 - 124 dropped) with 60500832 rows [11] Read the data jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| Read 54964696 rows x 23 columns from 49.00GB (52608776250 bytes) file in 30:26.810 wall clock time [12] Finalizing the datatable Type counts:
124 : drop '0'
3 : int32 '5'
7 : float64 '7'
13 : string 'A'
=============================
0.000s ( 0%) Memory map 48.996GB file
0.035s ( 0%) sep=',' ncol=147 and header detection
0.001s ( 0%) Column type detection using 10045 sample rows
6.000s ( 0%) Allocation of 60500832 rows x 147 cols (9.466GB) of which 54964696 ( 91%) rows used
1820.775s (100%) Reading 50176 chunks (0 swept) of 1.000MB (each chunk 1095 rows) using 16 threads + 1653.728s ( 91%) Parse to row-major thread buffers (grown 32 times) + 22.774s ( 1%) Transpose +
144.273s ( 8%) Waiting
24.545s ( 1%) Rereading 1 columns due to out-of-sample type exceptions
1826.810s Total Column 2 ("ProdType") bumped from 'int32' to 'string' due to <<"B810">> on row 14
Basically, I would like to find out if this is just normal or if there is anything I can do to improve these reading speeds. Based on various benchmarks I've seen around and my own experience and intuition with fread using smaller files, I would have expected this to be read in much much quicker.
Also I was wondering if the multi-core capabilities are fully being used, as I have heard that under Windows this might not always be straightforward. My knowledge around this topic is pretty limited unfortunately, but it does appear from the verbose output that fread is detecting 16 cores.
Thoughts:
(1) If you are using Windows, use Microsoft Open R; even more so if the cloud is Azure. Actually, there may be coordination between Open R and Azure client. Because of Intel's MKL and Microsoft built in enhancements, I find Microsoft Open R faster on Windows.
(2) I suspect 'Select' and 'Drop' work after a full file read. Maybe read all the file, subset or filter afterward.
(3) I think a restart is overkill. I run gc thrice every so often like this: 'gc();gc();gc();'' I have heard others say this does nothing. But at least it makes me feel better. Actually, I notice it helps me on Windows.
(4) The latest versions of data.table fread are implementing 'YAML' . This looks promising.
(5) setDTthread(0) uses all the cores. Too much parellitization can work against you. Trying halving your cores.

Reading MarkLogic logs from Query Console using XQuery

I want to read MarkLogic logs (for eg : ErrorLog.txt) from query console using Xquery. I had the below code but the problem is output is not properly formatted. Result is like below
xquery version "1.0-ml";
for $hid in xdmp:hosts()
let $h := xdmp:host-name($hid)
return
xdmp:filesystem-file("file://" || $h || "/" ||xdmp:data-directory($hid) ||"/Logs/ErrorLog.txt")
Problem is result is coming as per host basis like first all log of one host is coming and then starting with time 00:00:01 of host 2 and then 00:00:01 of host 3 after running the Xquery.
2019-07-02 00:00:35.668 Info: Merging 2 MB from /cams/q06data02/testQA2/Forests/testQA2-2.2/0002b4cd to /cams/q06data02/testQA2/Forests/testQA2-2.2/0002b4ce, timestamp=15620394303480170
2019-07-02 00:00:36.007 Info: Merged 3 MB at 9 MB/sec to /cams/q06data02/testQA2/Forests/test2-2.2/0002b4ce
2019-07-02 00:00:38.161 Info: Deleted 3 MB at 399 MB/sec /cams/q06data02/test2/Forests/test2-2.2/0002b4cd
Is it possible to get the output with hostname included with log entries and also if we can sort the output by timelines something like
host 1 : 2019-07-02 00:00:01 : Info Merging ....
host 2 : 2019-07-02 00:00:02 : Info Deleted 3 MB at 399 MB/sec ...
Log files are text files. You can parse and sort them like any other text file.
Although they can get very large (GB+), so simple methods may not be performant.
Plus you need to be able to parse the text into fields in order to sort by a field.
Since the first 20 bytes of every line is the time stamp, and that timestamp is in ISO format which sorts lexically same as date, you can split the file by lines and sort using basic colation to get by time sorting of multiple files.
In V9 one can use the pair of xdmp:logfile-scan and xdmp:logmessage-parse to efficiently search over log files (remotely as well as local) and then transform the results into text, XML (attribute or element format) or JSON.
One can also use the REST API for the same.
see: https://docs.marklogic.com/REST/GET/manage/v2/logs
Once logfiles (ideally a selected subset of log messages that is small enough to manage) is converted to a structured format (xml , json or text lines) then sorting, searching, enriching etc is easily performed.
For something much better take a look at Ops Director https://docs.marklogic.com/guide/opsdir/intro

BaseX GUI: Writeback not setting to true

I am using BaseX 7.9 and want to set the WRITEBACK option to true. So, I execute db:writeback[true] on the Editor window
The Query Info shows:
Compiling:
- removing unknown element/attribute true
- db:writeback[()]: removing ()
Query:
db:writeback[true]
Optimized Query:
()
Result:
- Hit(s): 0 Items
- Updated: 0 Items
- Printed: 0 Bytes
- Read Locking: local [prueba_08242014_01]
- Write Locking: none
Timing:
- Parsing: 0.93 ms
- Compiling: 0.27 ms
- Evaluating: 0.42 ms
- Printing: 1.24 ms
- Total Time: 2.86 ms
Query plan:
<QueryPlan>
<Empty size="0"/>
</QueryPlan>
Yet, then when I execute db:system(), WRITEBACK appears as false on the result window:
<system>
<localoptions>
...
<writeback>false</writeback>
...
</localoptions>
</system>
(It is abbreviated)
What Went Wrong
BaseX automatically registers the db prefix for the http://basex.org/modules/db namespace. Your code is evaluated as XQuery, and returns all root elements in the db namespace with local name writeback, and then filters those with a predicate for those having a true child node. An input document that would match this query is
<writeback xmlns="http://basex.org/modules/db"><true/></writeback>
Modifying Options
To modify options in BaseX, use the SET [option] [value] command in the Command input.

Perl encounters "out of memory" in openvms system

I am using a 32 bit perl in my openvms system.(So perl can access up till 2gb of virtual address space ).
I am hitting "out of memory!" in a large perl script. I zeroed in on the location of variable causing this . However after my tests with devel:size it turns out the array is using only 13 Mb memory and the hash is using much less than that.
My question is about memory profiling this perl script in VMS.
is there a good way of doing memory profile on VMS?
I used size to get size of array and hash.(Array is local scope and hash is global scope)
DV Z01 A4:[INTRO_DIR]$ perl scanner_SCANDIR.PL
Directory is Z3:[new_dir]
13399796 is total on array
3475702 is total on hash
Directory is Z3:[new_dir.subdir]
2506647 is total on array
4055817 is total on hash
Directory is Z3:[new_dir.subdir.OBJECT]
5704387 is total on array
6040449 is total on hash
Directory is Z3:[new_dir.subdir.XFET]
1585226 is total on array
6390119 is total on hash
Directory is Z3:[new_dir.subdir.1]
3527966 is total on array
7426150 is total on hash
Directory is Z3:[new_dir.subdir.2]
1698678 is total on array
7777489 is total on hash
(edited: Pmis-spelled GFLQUOTA )
Where is that output coming from? To OpenVMS folks it suggests files in directories, which the code might suck in? There would typically be considerable malloc/align overhead per element saved.
Anyway the available ADDRESSABLE memory when strictly using 32 pointers on OpenVMS is 1GB: 0x0 .. 0x3fffffff, not 2GB, for programs and (malloc) data for 'P0' space. There is also room in P1 (0x7fffffff .. 0x4000000) for thread-local stack storages, but perl does not use (much) of that.
From a second session you can look at that with DCL:
$ pid = "xxxxxxxx"
$ write sys$output f$getjpi(pid,"FREP0VA"), " ", f$getjpi(pid,"FREP1VA")
$ write sys$output f$getjpi(pid,"PPGCNT"), " ", f$getjpi(pid,"GPGCNT")
$ write sys$output f$getjpi(pid,"PGFLQUOTA")
However... those are just addresses ranges, NOT how much memory the process is allowed to used. That's governed by the process page-file-quota. Check with $ SHOW PROC/QUOTA before running perl. And its usage can be reported as per above from the outside adding Private pages and Groups-shared pages as per above.
An other nice way to look at memory (and other quota) is SHOW PROC/CONT ... and then hit "q"
So how many elements are stored in each large active array? How large is an average element, rounded up to 16 bytes? How many hash elements? How large are the key + value on average (round up generously)
What is the exact message?
Does the program 'blow' up right away, or after a while (so you can use SHOW PROC/CONT)
Is there a source file data set (size) that does work?
Cheers,
Hein.

Resources