Performance issue in oracle coherence cache using LikeFilter wildcard search - oracle-coherence

I have implemented wildcard search using oracle coherence API. When I execute the search on string fields(four fields) using
1) "LikeFilter" with "fIgnoreCase" as true and
2) search text is % patterns(eg: "%test%") and
3) accumulated those using " AnyFilter", and
4) the volume of data in the cache is huge then the searches become very slow.
Applying the standard index does not have any effect on the performance, as it appears that this index works only for exact matches or comparisons.
Is there any special type of index in Coherence for wildcard searches (similar to the new indexes in Oracle TEXT)? If not, is there any other way to improve wildcard query performance on Coherence, with large data sets in the cache?

Please provide code snippet to understand the current solution applied. Also, hope following practices already applied:
Explain plan to see the query performance
Leveraging data-grid wide execution for parallel processing considering volume of data
Also, need information on volume of data (in GB) along with Coherence setup in place (no. of nodes, size of each node) to understand sizing of the cluster.

Related

How to model data in dynamodb if your access pattern includes many WHERE conditions

I am a bit confused if this is possible in DynamoDB.
I will give an example of SQL and explain how the query could be optimized and then I will try to explain why I am confused on how to model this and how to access the same data in DynamoDB.
This is not company code. Just an example I made up based on pcpartpicker filter.
SELECT * FROM BUILDS
WHERE CPU='Intel' AND 'OVERCLOCKED'='true'
AND Price < 3000
AND GPU='GeForce RTX 3060'
AND ...
From my understanding, SQL will first do a scan on the BUILDS table and then filter out all the builds where CPU is using intel. From this subset, it then does another WHERE clause to filter 'OVERCLOCEKD' = true so on and so forth. Basically, all of the additional WHERE clauses have a smaller number of rows to filter.
One thing we can do to speed up this query is to create an index on these columns. The main increase in performance is reducing the initial scan on the whole table for the first clause that the database looks at. So in the example above instead of scanning the whole db to find builds that are using intel it can quickly retrieve them since it is indexed.
How would you model this data in DynamoDB? I know you can create a bunch of secondary Indexes but instead of letting the engine do the WHERE clause and passing along the result to do the next set of filtering. It seems like you would have to do all of this yourself. For example, we would need to use our secondary index to find all the builds that use intel, overclocked, less than 3000, and using a specific GPU and then we would need to find the intersection ourselves. Is there a better way to map out this access pattern? I am having a hard time figuring out if this is even possible.
EDIT:
I know I could also just use a normal filter but it seems like this would be pretty expensive since it basically brute force search through the table similar to the SQL solution without indexing.
To see what I mean from pcpartpicker here is the link to the site with this page: https://pcpartpicker.com/builds/
People basically select multiple filters so it makes designing for access patterns even harder.
I'd highly recommend going through the various AWS presentations on YouTube...
In particular here's a link to The Iron Triangle of Purpose - PIE Theorem chapter of the AWS re:Invent 2018: Building with AWS Databases: Match Your Workload to the Right Database (DAT301) presentation.
DynamoDB provides IE - Infinite Scale and Efficiency.
But you need P - Pattern Flexibility.
You'll need to decide if you need PI or PE.

eXistDB performance issues - how to improve?

I'm here to ask if I'm configuring eXist DB wrong or if it's simply unable to cope with the amount of data I need to store and query in eXist.
I'm running eXist 4.3.1 stable on Ubuntu 18 on a machine with a Quad-Core i5 with 16GB RAM whereby I've allocated 8GB to eXist. I configured new range indexes on all values I'm interested in querying. The indexes work, I can test them with simple queries and in Monex they show up as full optimized using the new range index.
Right now I'm testing with 110434 XML Files with sizes between 20kb to 3mb. I'm using XML Namespaces and optimized queries (I read https://exist-db.org/exist/apps/doc/tuning) but still I observe insanely long exectuion times.
This query:
xquery version "3.1";
declare namespace oai = "http://www.openarchives.org/OAI/2.0/";
for $x in collection("/db/apps/ddb/data")
return $x//oai:identifier
takes 0.5 Seconds to execute (great!). If I use a contains predicate test (which is using a new range index), like so:
xquery version "3.1";
declare namespace oai = "http://www.openarchives.org/OAI/2.0/";
for $x in collection("/db/apps/ddb/data")
return $x//oai:identifier[contains(., 'mainz')]
the execution time longer than 5 Minutes, which is by no means acceptable.
I attached an image which shows the long exection times and the index usage:
It would be great if someone who works with large Datasets in eXist could comment on the performance of eXist or if someone could comment on my Index configuration and/or query writing.
Thanks!
To expand on Amrendra Kumar's tip, and to quote the docs you already mentioned:
Consider an n-gram index for exact substring queries on longer text
sequences
Xquery string operations, e.g those that process reg-ex require full-text processing, regardless of the fact that you have created a new-range index or not. You can either substitute these expressions with another more performant expression, matches () instead of contains() or use another index n-gram or full-text instead of new-range.
When you have defined multiple indexes on the same node. You can specify which index should be used by using the appropriate xQuery expressions, such as ft:query. This can greatly improve performance, since the default query optimizer is bound to get it wrong every now and then.
Without adding some sample data and the collection.xconf it is, however, impossible to comment on your index configuration. Your needs don't sound particularly outlandish for exist-db, but without knowing more about concurrent users, update frequency etc, I can't offer more than a general remarks like this.

High write concurrency backend for storing large set/array based data?

The problem:
I have a web service that needs to check membership of a given string against a set of strings, where number of elements in the set will be under constant growth, potentially numbering in the hundreds of millions.
If the string is not a member of the set, it gets added to the set. The string size will be a constant 32 bytes. Only one set variable is required, no other variables need to be persisted.
This check is performed as part of a callback on a webhook, thus performance is critical.
While my use case pretty much fits a bloom filter perfectly, I'm having trouble finding a solution to deal with the persistent storage vs i/o concurrency portion of the problem.
Environment:
DigitalOcean/Linux/Python/Flask, but open to change if required
Possible Solutions:
redis, storing the variable in a set, and then querying via sismember for a nice o(1) based solution. This is what we are currently using, but this solution doesn't scale well with a large number of keys given that everything must fit in memory, and it also has issues with write concurrency when traffic increases.
sqlite, with WAL mode turned on. concerned about lock contention when the server gets hit with a significant number of webhook requests (SQLITE_BUSY). Local server file doesn't scale across host machines.
postgres, seems like a nice middle ground solution, but might have to deal with lock contention here as well for write concurrency.
cassandra, given it's focus on write performance. overkill for storing a single column though?
custom bloom filter backend, not sure if something like this exists that provides the functionality of a bloom filter with a high i/o concurrency storage backend.
Thoughts?
The Redis solution can scale well with data sharding. You can set up several Redis instances (or use Redis-Cluster), split your data into several parts, i.e. shardings, and save each part in a different Redis instance.
When you want to check the membership of a given string, you can send the sismenber command to the corresponding Redis instance. Take this answer as an example of how to split data with hash functions.
Also, you can implement bloom filter with Redis (GETBIT and SETBIT). Just a reminder, bloom filter has the false positive problem.
First, you don't need to use sismember. Just do sadd systematically, and test the returned value. If it's 0, the value was already in the set, and so was not added. Doing so you will very easily reduce the number of requests to Redis.
Second, the description of your problem looks like a perfect match for Hbase, which is made for storing very large data set and query them using bloom filters. But you'll probably find it's overkill, just like Cassandra.

Caching result of SELECT statement for reuse in multiple queries

I have a reasonably complex query to extract the Id field of the results I am interested in based on parameters entered by the user.
After extracting the relevant Ids I am using the resulting set of Ids several times, in separate queries, to extract the actual output record sets I want (by joining to other tables, using aggregate functions, etc).
I would like to avoid running the initial query separately for every set of results I want to return. I imagine my situation is a common pattern so I am interested in what the best approach is.
The database is in MS SQL Server and I am using .NET 3.5.
It would definitely help if the question contained some measurements of the unoptimized solution (data sizes, timings). There is a variety of techniques that could be considered here, some listed in the other answers. I will assume that the reason why you do not want to run the same query repeatedly is performance.
If all the uses of the set of cached IDs consist of joins of the whole set to additional tables, the solution should definitely not involve caching the set of IDs outside of the database. Data should not travel there and back again if you can avoid it.
In some cases (when cursors or extremely complex SQL are not involved) it may be best (even if counterintuitive) to perform no caching and simply join the repetitive SQL to all desired queries. After all, each query needs to be traversed based on one of the joined tables and then the performance depends to a large degree on availability of indexes necessary to join and evaluate all the remaining information quickly.
The most intuitive approach to "caching" the set of IDs within the database is a temporary table (if named #something, it is private to the connection and therefore usable by parallel independent clients; or it can be named ##something and be global). If the table is going to have many records, indexes are necessary. For optimum performance, the index should be a clustered index (only one per table allowed), or be only created after constructing that set, where index creation is slightly faster.
Indexed views are cleary preferable to temporary tables except when the underlying data is read only during the whole process or when you can and want to ignore such updates to keep the whole set of reports consistent as far as the set goes. However, the ability of indexed views to always accurately project the underlying data comes at a cost of slowing down those updates.
One other answer to this question mentions stored procedures. This is largely a way of organizing your code. However, it if you go this way, it is preferable to avoid using temporary tables, because such references to a temporary table prevent pre-compilation of the stored procedure; go for views or indexed views if you can.
Regardless of the approach you choose, do not guess at the performance characteristics and query optimizer behavior. Learn to display query execution plans (within SQL Server Management Studio) and make sure that you see index accesses as opposed to nested loops combining multiple large sets of data; only add indexes that demonstrably and drastically change the performance of your queries. A well chosen index can often change the performance of a query by a factor of 1000, so this is somewhat complex to learn but crucial for success.
And last but not least, make sure you use UPDATE STATISTICS when repopulating the database (and nightly in production), or your query optimizer will not be able to put the indexes you have created to their best uses.
If you are planning to cache the result set in your application code, then ASP.NET has cache, Your Winform will have the object holding the data with it with which you can reuse the data.
If planning to do the same in SQL Server, you might consider using indexed views to find out the Id's. The view will be materialized and hence you can get the results faster. You might even consider using a staging table to hold the id's temporarily.
With SQL Server 2008 you can pass table variables as params to SQL. Just cache the IDs and then pass them as a table variable to the queries that fetch the data. The only caveat of this approach is that you have to predefine the table type as UDT.
http://msdn.microsoft.com/en-us/library/bb510489.aspx
For SQL Server, Microsoft generally recommends using stored procedures whenever practical.
Here are a few of the advantages:
http://blog.sqlauthority.com/2007/04/13/sql-server-stored-procedures-advantages-and-best-advantage/
* Execution plan retention and reuse
* Query auto-parameterization
* Encapsulation of business rules and policies
* Application modularization
* Sharing of application logic between applications
* Access to database objects that is both secure and uniform
* Consistent, safe data modification
* Network bandwidth conservation
* Support for automatic execution at system start-up
* Enhanced hardware and software capabilities
* Improved security
* Reduced development cost and increased reliability
* Centralized security, administration, and maintenance for common routines
It's also worth noting that, unlike other RDBMS vendors (like Oracle, for example), MSSQL automatically caches all execution plans:
http://msdn.microsoft.com/en-us/library/ms973918.aspx
However, for the last couple of versions of SQL Server, execution
plans are cached for all T-SQL batches, regardless of whether or not
they are in a stored procedure
The best approach depends on how often the Id changes, or how often you want to look it up again.
One technique is to simply store the result in the ASP.NET object cache, using the Cache object (also accessible from HttpRuntime.Cache). For example (from a page):
this.Cache["key"] = "value";
There are many possible variations on this theme.
You can use Memcached to cache values in the memory.
As I see there are some .net ports.
How frequently does the data change that you'll be querying? To me, this sounds like a perfect scenario for data warehousing, where you flatting the data for quicker data retrieval and create the tables exactly as your 'DTO' wants to see the data. This method is different than an indexed view in that it's simply a table which will have quick seek operations, and could especially be improved if you setup the indexes properly on the columns that you plan to query
You can create Global temporary Table. Create the table on the fly. Now insert the records as per your request. Access this table in your next request in your joins... for reusability

Fast, scalable string lookup

I have a set of 5 million strings. These are currently stored in a single column MySQL table. My application has to perform lookups and check if a given string is in the set. This can of course be done with a HashSet (in Java). But instead of building a custom solution, I was wondering if there are any existing, widely used, proven solutions that do this? It seems like a common scenario. The solution should be scalable (the set might increase beyond 5 million), have failover (so probably distributed) and perform well under a huge number of requests. Any suggestions?
Update: My app can also query to check if a given set of strings is present in the global (the 5 million one) set.
You can try Trie or Patricia-trie.The second is more memory efficient.Also here you can find a comparison of 2 data structures [Trie,TreeSet],In-memory database and their performance.
Try memcached, a high-performance, distributed memory object caching system. You lookup using key/value hashes. Facebook uses memcached as do many other highly scalable sites. Need to store more strings? Just add more memcached instances to the cluster. Plus you can use in a 2-tier caching setup where you first query memcached, if cache miss then query the full database.
Have you considered adding column indexing to your MySQL database? Hash, b-tree and r-tree are supported.
MySQL can also be replicated and clustered for high scalability.
While a Trie might be the best solution, binary search on the sorted list of strings should also perform well run time wise.

Resources