Neptune DB slow query when using text predications

Neptune DB slow query when using text predications - gremlin

When trying to find list of users that startsWith some text example :
g.V().hasLabel('user').has('username', startingWith('m')).limit(12).values('username'); //4 Seconds
g.V().hasLabel('user').has('username', startingWith('madyan')).limit(12).values('username'); //2 Minutes
The more letters we are searching for the slower it gets, why is that ?
And how can we optimise the query, 2 Minutes is just very high ?

Neptune maintains multiple indices for exact match queries but does not include a native full-text search index. So any query that uses the text predicate steps (startingwith(), endingWith(), containing(), etc.) would be full or partial scans of all objects/properties that you're attempting to search for.
To make this more performant, Neptune has a Full-Text Search feature (1) that leverages an external OpenSearch cluster. You can then use the index maintained within OpenSearch through additional steps within Gremlin (or SPARQL) in Neptune to accomplish this.
(1) https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html

Related

How to model data in dynamodb if your access pattern includes many WHERE conditions

I am a bit confused if this is possible in DynamoDB.
I will give an example of SQL and explain how the query could be optimized and then I will try to explain why I am confused on how to model this and how to access the same data in DynamoDB.
This is not company code. Just an example I made up based on pcpartpicker filter.
SELECT * FROM BUILDS
WHERE CPU='Intel' AND 'OVERCLOCKED'='true'
AND Price < 3000
AND GPU='GeForce RTX 3060'
AND ...
From my understanding, SQL will first do a scan on the BUILDS table and then filter out all the builds where CPU is using intel. From this subset, it then does another WHERE clause to filter 'OVERCLOCEKD' = true so on and so forth. Basically, all of the additional WHERE clauses have a smaller number of rows to filter.
One thing we can do to speed up this query is to create an index on these columns. The main increase in performance is reducing the initial scan on the whole table for the first clause that the database looks at. So in the example above instead of scanning the whole db to find builds that are using intel it can quickly retrieve them since it is indexed.
How would you model this data in DynamoDB? I know you can create a bunch of secondary Indexes but instead of letting the engine do the WHERE clause and passing along the result to do the next set of filtering. It seems like you would have to do all of this yourself. For example, we would need to use our secondary index to find all the builds that use intel, overclocked, less than 3000, and using a specific GPU and then we would need to find the intersection ourselves. Is there a better way to map out this access pattern? I am having a hard time figuring out if this is even possible.
EDIT:
I know I could also just use a normal filter but it seems like this would be pretty expensive since it basically brute force search through the table similar to the SQL solution without indexing.
To see what I mean from pcpartpicker here is the link to the site with this page: https://pcpartpicker.com/builds/
People basically select multiple filters so it makes designing for access patterns even harder.

I'd highly recommend going through the various AWS presentations on YouTube...
In particular here's a link to The Iron Triangle of Purpose - PIE Theorem chapter of the AWS re:Invent 2018: Building with AWS Databases: Match Your Workload to the Right Database (DAT301) presentation.
DynamoDB provides IE - Infinite Scale and Efficiency.
But you need P - Pattern Flexibility.
You'll need to decide if you need PI or PE.

Faster JanusGraph queries

Previously my janusgraph was working fine with nearly 1000 elements. I made my graph 100k and now some queries are very slow. I'm using Cassandra as database backend and elastic search as index backend.
For example
g.V().hasLabel('Person').union(count(), range(0, 15)) this takes 1.3 minutes. I just want to get 15 "Person". It should be fast.
g.E().count() takes 1.4 minutes. Why it is soo slow? I only have 200k nodes and 400k edges.
I also have some composite indexes for my "Person" node type. I have indexes for all the properties of "Person" node type
thanks for any advice

Counting vertices with a specific label is a frequent use case that JanusGraph can only address by running a full table scan. So, for larger graphs counting becomes unbearably slow. Because the usecase is so frequent, an issue was created for it.
If you use Janusgraph with an indexing backend for mixed indexes. it is possible to run a so-called direct index query in JanusGraph that can return totals of index hits in a speedier way. This does not work on label filters, though, but on indexed properties.

Query DynamoDB table attributes using an AND and WHERE like statement

I have a flat table with around 30 attributes in DynamoDB. I would like to expose an API for my end users/applications to query on a random combination of those attributes.
This is trivial to do in a typical RDBMS.
How can we do this in DynamoDB? What kind of modelling techniques and/or Key condition expressions can we use to achieve this.

Multi-faceted search like you describe can be challenging in DynamoDB. It can certainly be done, but you may be fighting the tool depending on your specific access patterns. Search in DynamoDB is supported through query (fast and cheap) and scan (slower and expensive) operations. You may want to take some time to read the docs to understand how each works, and why it's critical to structure your data to support your access patterns.
One options is to use ElasticSearch. DynamoDB Streams can be used to keep the ElasticSearch index updated when an operation happens in DynamoDb. There are even AWS docs on this particular setup.

AWS Neptune Gremlin paginate on hashed edge ID

I have a very large data set, close to 500 million edges in which almost all edges need to be traversed. I'm trying to parallelize these traversals by trying to paginate on IDS. My strategy was to try and paginate by ID which is an MD5 hash. I tried queries like the following:
g.E().hasLabel('foo').has(id, TextP.startingWith('AAA')) for page 1
g.E().hasLabel('foo').has(id, TextP.startingWith('AAB')) for page 2
But each query seems to be doing a full scan and not just a subset. How do you recommend I go about pagination?

I suggest that you run profile step on your queries to see the amount of actual traversals.
Using startingWith predicate on id doesn't seem like an optimized solution to me, since it probably uses an hash index, and not range index.
I would try to prefix on other string property, or even add a random [1..n] 'replica' property and filter using .has('replica', i) to get the best performance, especially on such a large graph.

Azure Cosmos DB aggregation and indexes

I'm trying to use Cosmos DB and I'm having some trouble making a simple count in a collection.
My collection schema is below and I have 80.000 documents in this collection.
{
"_id" : ObjectId("5aca8ea670ed86102488d39d"),
"UserID" : "5ac161d742092040783a4ee1",
"ReferenceID" : 87396,
"ReferenceDate" : ISODate("2018-04-08T21:50:30.167Z"),
"ElapsedTime" : 1694,
"CreatedDate" : ISODate("2018-04-08T21:50:30.168Z")
}
If I run this command below to count all documents in collection, I have the result so quickly:
db.Tests.count()
But when I run this same command but to a specific user, I've got a message "Request rate is large".
db.Tests.find({UserID:"5ac161d742092040783a4ee1"}).count()
In the Cosmos DB documentation I found this cenario and the suggestion is increase RU. Currently I have 400 RU/s, when I increase to 10.000 RU/s I'm capable to run the command with no errors but in 5 seconds.
I already tryed to create index explicity, but it seems Cosmos DB doesn't use the index to make count.
I do not think it is reasonable to have to pay 10,000 RU / s for a simple count in a collection with approximately 100,000 documents, although it takes about 5 seconds.

Count by filter queries ARE using indexes if they are available.
If you try count by filter on a not indexed column the query would not time out, but fail. Try it. You should get error along the lines of:
{"Errors":["An invalid query has been specified with filters against path(s) excluded from indexing. Consider adding allow scan header in the request."]}
So definitely add a suitable index on UserID.
If you don't have index coverage and don't get the above error then you probably have set the enableScanInQuery flag. This is almost always a bad idea, and full scan would not scale. Meaning - it would consume increasingly large amounts of RU as your dataset grows. So make sure it is off and index instead.
When you DO have index on the selected column your query should run. You can verify that index is actually being used by sending the x-ms-documentdb-populatequerymetrics header. Which should return you confirmation with indexLookupTimeInMs and indexUtilizationRatio field. Example output:
"totalExecutionTimeInMs=8.44;queryCompileTimeInMs=8.01;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.14;indexLookupTimeInMs=0.11;documentLoadTimeInMs=0.00;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=0;retrievedDocumentSize=0;outputDocumentCount=1;outputDocumentSize=0;writeOutputTimeInMs=0.01;indexUtilizationRatio=0.00"
It also provides you some insight where the effort has gone if you feel like RU charge is too large.
If index lookup time itself is too high, consider if you index is selective enough and if the index settings are suitable. Look at your UserId values and distribution and adjust the index accordingly.
Another wild guess to consider is to check if the API you are using would defer executing find(..) until it knows that count() is really what you are after. It is unclear which API you are using. If it turns out it is fetching all matching documents to client side before doing the counting then that would explain unexpectedly high RU cost, especially if there are large amount of matching documents or large documents involved. Check the API documentation.
I also suggest executing the same query directly in Azure Portal to compare the RU cost and verify if the issue is client-related or not.

I think it just doesn't work.
The index seems to be used when selecting the documents to be counted, but then the count is done by reading each document, so effectively consuming a lot of RU.
This query is cheap and fast:
db.Tests.count({ UserID: { '$eq': '5ac161d742092040783a4ee1' }})
but this one is slow and expensive:
db.Tests.count({ ReferenceID: { '$gt': 10 }})
even though this query is fast:
db.Tests.find({ ReferenceID: { '$gt': 10 }}).sort({ ReferenceID: 1 })
I also found this: https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/36142468-make-count-aware-of-indexes. Note the status: "We have started work on this feature. Will update here when this becomes generally available."
Pretty disappointing to be honest, especially since this limitation hasn't been addressed for almost 2 years. Note - I am not an expert in this matter and I'd love to be proven wrong, since I also need this feature.
BTW: I noticed that simple indexes seem to be created automatically for each individual field, so no need to create them manually.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex