I'm here to ask if I'm configuring eXist DB wrong or if it's simply unable to cope with the amount of data I need to store and query in eXist.
I'm running eXist 4.3.1 stable on Ubuntu 18 on a machine with a Quad-Core i5 with 16GB RAM whereby I've allocated 8GB to eXist. I configured new range indexes on all values I'm interested in querying. The indexes work, I can test them with simple queries and in Monex they show up as full optimized using the new range index.
Right now I'm testing with 110434 XML Files with sizes between 20kb to 3mb. I'm using XML Namespaces and optimized queries (I read https://exist-db.org/exist/apps/doc/tuning) but still I observe insanely long exectuion times.
This query:
xquery version "3.1";
declare namespace oai = "http://www.openarchives.org/OAI/2.0/";
for $x in collection("/db/apps/ddb/data")
return $x//oai:identifier
takes 0.5 Seconds to execute (great!). If I use a contains predicate test (which is using a new range index), like so:
xquery version "3.1";
declare namespace oai = "http://www.openarchives.org/OAI/2.0/";
for $x in collection("/db/apps/ddb/data")
return $x//oai:identifier[contains(., 'mainz')]
the execution time longer than 5 Minutes, which is by no means acceptable.
I attached an image which shows the long exection times and the index usage:
It would be great if someone who works with large Datasets in eXist could comment on the performance of eXist or if someone could comment on my Index configuration and/or query writing.
Thanks!
To expand on Amrendra Kumar's tip, and to quote the docs you already mentioned:
Consider an n-gram index for exact substring queries on longer text
sequences
Xquery string operations, e.g those that process reg-ex require full-text processing, regardless of the fact that you have created a new-range index or not. You can either substitute these expressions with another more performant expression, matches () instead of contains() or use another index n-gram or full-text instead of new-range.
When you have defined multiple indexes on the same node. You can specify which index should be used by using the appropriate xQuery expressions, such as ft:query. This can greatly improve performance, since the default query optimizer is bound to get it wrong every now and then.
Without adding some sample data and the collection.xconf it is, however, impossible to comment on your index configuration. Your needs don't sound particularly outlandish for exist-db, but without knowing more about concurrent users, update frequency etc, I can't offer more than a general remarks like this.
Related
I am a bit confused if this is possible in DynamoDB.
I will give an example of SQL and explain how the query could be optimized and then I will try to explain why I am confused on how to model this and how to access the same data in DynamoDB.
This is not company code. Just an example I made up based on pcpartpicker filter.
SELECT * FROM BUILDS
WHERE CPU='Intel' AND 'OVERCLOCKED'='true'
AND Price < 3000
AND GPU='GeForce RTX 3060'
AND ...
From my understanding, SQL will first do a scan on the BUILDS table and then filter out all the builds where CPU is using intel. From this subset, it then does another WHERE clause to filter 'OVERCLOCEKD' = true so on and so forth. Basically, all of the additional WHERE clauses have a smaller number of rows to filter.
One thing we can do to speed up this query is to create an index on these columns. The main increase in performance is reducing the initial scan on the whole table for the first clause that the database looks at. So in the example above instead of scanning the whole db to find builds that are using intel it can quickly retrieve them since it is indexed.
How would you model this data in DynamoDB? I know you can create a bunch of secondary Indexes but instead of letting the engine do the WHERE clause and passing along the result to do the next set of filtering. It seems like you would have to do all of this yourself. For example, we would need to use our secondary index to find all the builds that use intel, overclocked, less than 3000, and using a specific GPU and then we would need to find the intersection ourselves. Is there a better way to map out this access pattern? I am having a hard time figuring out if this is even possible.
EDIT:
I know I could also just use a normal filter but it seems like this would be pretty expensive since it basically brute force search through the table similar to the SQL solution without indexing.
To see what I mean from pcpartpicker here is the link to the site with this page: https://pcpartpicker.com/builds/
People basically select multiple filters so it makes designing for access patterns even harder.
I'd highly recommend going through the various AWS presentations on YouTube...
In particular here's a link to The Iron Triangle of Purpose - PIE Theorem chapter of the AWS re:Invent 2018: Building with AWS Databases: Match Your Workload to the Right Database (DAT301) presentation.
DynamoDB provides IE - Infinite Scale and Efficiency.
But you need P - Pattern Flexibility.
You'll need to decide if you need PI or PE.
I have implemented wildcard search using oracle coherence API. When I execute the search on string fields(four fields) using
1) "LikeFilter" with "fIgnoreCase" as true and
2) search text is % patterns(eg: "%test%") and
3) accumulated those using " AnyFilter", and
4) the volume of data in the cache is huge then the searches become very slow.
Applying the standard index does not have any effect on the performance, as it appears that this index works only for exact matches or comparisons.
Is there any special type of index in Coherence for wildcard searches (similar to the new indexes in Oracle TEXT)? If not, is there any other way to improve wildcard query performance on Coherence, with large data sets in the cache?
Please provide code snippet to understand the current solution applied. Also, hope following practices already applied:
Explain plan to see the query performance
Leveraging data-grid wide execution for parallel processing considering volume of data
Also, need information on volume of data (in GB) along with Coherence setup in place (no. of nodes, size of each node) to understand sizing of the cluster.
What would be the right way of searching riak-search for documents that need correction, then update them ?
By design, riak-search is an index that may NOT stick to the riak-kv content. I except that on heavy duty check/write operation that my index won't match my riak-kv content.
I count on riak-search to limit read/write operation on a limited number of matching entries.
I really can't operate using this kind of algorithm:
page=0
while true:
results = riak.search('index', 'sex:male', start=page)
if results['num_found'] == 0:
break
for r in results['docs']:
obj = riak.bucket_type(r['_yz_rt']).bucket('_yz_rb').get('_yz_rk')
// alter object
obj.store()
page = page + len(results['docs])
I see a lot of issues with it:
First, as riak-search catches up, it won't find the first documents I altered, breaking my pagination.
Paginate from the end, is a tempting alternative, but it will stress solr with that, or hit the max_search_results limit
Testing num_found is not a good way of breaking the loop, i'm pretty sure of it.
Should load all riak-kv keys before starting to edit ? Is there a proper algorithm/way to achieve my needs ?
EDIT:
My use case is the following. I store text document that content an array of terms from my string tokenizer algorithm, as any machine learning system it evolves and getting better over time. The string tokenizer is doing nothing but creating a word cloud.
My bucket type is ever growing and I need to patch old term array from previous tokenizer version. To achieve that I am willing to search old documents or documents that contains bad tokens that I know where corrected in my new tokenizer version.
So, my search query is either:
terms:badtoken
created_date:[2000-11-01 TO 2014-12-01]
Working with date is not an issue, but working with token is. As removing the badtoken from the document will change the solr index in a matter of seconds and while still searching for "badtoken". It will change my current pagination, and make me miss documents.
For the moment, I renounced to use the index and simply walk the whole bucket.
I am facing a performance issue in one of my stored procedures.
Following is the pseudo-code:
PROCEDURE SP_GET_EMPLOYEEDETAILS(P_EMP_ID IN NUMBER, CUR_OUT OUT REF CURSOR)
IS
BEGIN
OPEN CUR_OUT FOR
SELECT EMP_NAME, EMAIL, DOB FROM T_EMPLOYEES WHERE EMP_ID=P_EMP_ID;
END;
The above stored procedure takes around 20 seconds to return the result set with let's say P_EMP_ID = 100.
However, if I hard-code employee ID as 100 in the stored procedure, the stored procedure returns the result set in 40 milliseconds.
So, the same stored procedure behaves differently for the same parameter value when the value is hard-coded instead of reading the parameter value.
The table T_EMPLOYEES has around 1 million records and there is an index on the EMP_ID column.
Would appreciate any help regarding this as to how I can improve the performance of this stored procedure or what could be the problem here.
This may be an issue with skewed data distribution and/or incomplete histograms and/or bad system tuning.
The fast version of the query is probably using an index. The slow version is probably doing a full-table-scan.
In order to know which to do, Oracle has to have an idea of the cardinality of the data (in your case, how many results will be returned). If it thinks a lot of results will be returned, it will go straight ahead and do a full-table-scan as it is not worth the overhead of using an index. If it thinks few results will be returned it will use an index to avoid scanning the whole table.
The issues are:
If using a literal value, Oracle knows exactly where to look in the histogram to see how many results would be returned. If using a bind variable, it is more complicated. Certainly, on Oracle 10 it didn't handle this well and just took a guess at the cardinality. On Oracle 11, I am not sure as it can do something called "bind variable peeking" - see SQL Plan Management.
Even if it does know the actual value, if your histogram is not up-to-date, it will get the wrong values.
Even if it works out an accurate guess as to how many results will be returned, you are still dependent on the Oracle system parameters being correct.
For this last point ... basically, Oracle has some parameters that tell it how fast it thinks a FTS is vs how fast an index look-up is. If these are not correct, it will may do an FTS even if it is a lot slower. See Burleson
My experience is that Oracle tends to flip to doing FTS way too early. Ideally, as the result set grows in size there should be a smooth transition in performance at the point where it goes from using an index to using an FTS, but in practice the systems seem to be set up to favour bulk work.
I'm trying to get 'xxx' parameter of all documents in Marklogic using query like:
(/doc/document)/xxx
But since we have very big documents database I get an error "Expanded tree cache full on host". I don't have admin rights for this server, so I can't change configuration. I suggest that I can use ranges while getting documents like:
(/doc/document)[1 to 1000]/xxx
and then
(/doc/document)[1000 to 2000]/xxx
etc, but I'm concerned that I do not know how it works, for example, what will happen if during this process database will be changed (f.e. a new document will be added), how will it affect the result documents list? Also I don't know which order it uses in case when I use ranges...
Please clarify, is this way can be appropriate or is there any other ways to get some parameter of all documents?
Depending on how big your database is there may be no way to get all the values in one transaction.
Suppose you have a trillion documents, the result set will be bigger then can be returned in one transaction.
Is that important ? Only your business case can tell.
The most efficient way of getting all "xxx" values is with a range index. You can see how this works
with cts:element-values ( https://docs.marklogic.com/cts:element-values )
You do need to be able to create a range index over the element "xxxx" to do this (ask your DBA).
Then cts:element-values() returns only those values and the chances of being able to return most or all of them
in memory in a signle transaction is much higher then using xpath (/doc/document/xxx) which as you wrote actualy returns all the "xxx" elements (not just their values). The most likely requires actually loading every document matching /doc and then parsing it and returning the xxx element. That can be both slow and inefficient.
A range index just stores the values and you can retrieve those without ever having to load the actual document.
In general when working with large datasets learning how to access data in MarkLogic using only indexes will produce the fastest results.