Are built-in index based ancestor queries efficient? - google-cloud-datastore

The indexes doc at https://cloud.google.com/datastore/docs/concepts/indexes says that built-in single property indexes can support
Queries using only ancestor and equality filters
Queries using only inequality filters (which are limited to a single property)
Since the built-in index for the property is sorted by the property value, I understand how it supports a single inequality filter. However, how is it able to support the equality filter with ancestor query? Say I have a million rows for the same property value, but the given ancestor condition only matches 100 rows within those million rows, would it have to scan all the million rows to find the 100 matching rows? I don't think that's the case as some where I read that Cloud Datastore scales with the number of rows in the result set and not the number of rows in the database. So, unless the single property index is internally a multi-column index with first column as the property and the second column as the entity key, I don't see how these ancestor + equality queries can be efficiently supported with built-in single property queries.

Cloud Datastore built-in indexes are always split into a prefix and a postfix at query time. The prefix portion is the part that remains the same (eg equalities or ancestors), the postfix portion is the part that changes (sort order).
Builtin indexes are laid out:
Kind, PropertyName, PropertyValue, Key
For example, a query: FROM MyKind WHERE A > 1
Would divide the prefix/postfix as:
MyKind,A | range<1, inf>
In the case you're asking about (ancestor with equality), FROM MyKind WHERE __key__ HAS ANCESTOR Key('MyAncestor', 1) AND A = 1 the first part of the prefix is easy:
MyKind,A,1
To understand the ancestor piece, we have to consider that Datastore keys are a hierarchy. In the case of MyKind, the keys might looks like: (MyAncestor, 1, MyKind, 345).
This means we can make the prefix for an ancestor + equality query as:
MyKind,A,1,(MyAncestor, 1)
The postfix would then just be all the keys that have (MyAncestor,1) as a prefix and A=1.
This is why you can have an equality with an ancestor using the built-in indexes, but not an inequality with an ancestor.
If you're interested, the video Google I/O 2010 - Next gen queries dives into this in depth.

According to this documentation "The rows of an index table are sorted first by ancestor and then by property values, in the order specified in the index definition."

Related

Tinkerpop Gremlin is it better to query with hasId or to search by property values

Using Tinkerpop Gremlin (Neptune DB), is there a preferred/"faster" way to query?
For example, let's say I have a graph containing the node:
label: Student
id: 'student/12345'
studentId: '12345'
name: 'Bob'
Is there a preferred query? (for this example let's say we know the field 'studentId' value, which is also part of the id)
g.V().filter('studentId', '12345')
vs
g.V().filter(hasId(TextP.containing('12345'))
or using "has"/"hasId" vs "filter"?
g.V().has('studentId', '12345')
vs
g.V().hasId(TextP.containing('12345'))
So there seems to be two questions here, one about filter() vs has() and the other about using the vertex id versus a property.
The answer to the first question is going to depend on the underlying database implementation and what is has/has not optimized. In general, and in Neptune, I would suggest using the g.V().has('studentId', '12345') pattern to filter on a property as it is optimized and easier to read.
The answer to the second question also depends on the database implementaiton, as not all allow for setting of the vertex ids. Other databases may vary but in Neptune setting ids is allowed and a direct lookup by ID is the fastest (e.g. g.V('12345') or g.V().hasId('12345')) way to look something up as it is a single index lookup. One thing to note is that in Neptune vertex/edge id values need to be globally unique so you need to ensure that you will only have one vertex or edge with a specific id.

Querying on Global Secondary indexes with a usage of contains operator

I've been reading a DynamoDB docs and was unable to understand if it does make sense to query on Global Secondary Index with a usage of 'contains' operator.
My problem is as follows: my dynamoDB document has a list of embedded objects, every object has a 'code' field which is unique:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
]
}
I want to be able to get all documents that contain entities with entity.code = X.
For this purpose I'm considering adding a Global Secondary Index that would contain all entity.codes that are present in current db document separated by a comma. So the example above would look like:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
],
"entitiesGlobalSecondaryIndex":"entityCode1,entityCode2"
}
And then I would like to apply filter expression on entitiesGlobalSecondaryIndex something like: entitiesGlobalSecondaryIndex contains entityCode1.
Would this be efficient or using global secondary index does not make sense in this way and DynamoDB will simply check the condition against every document which is similar so scan?
Any help is very appreciated,
Thanks
The contains operator of a query cannot be run on a partition Key. In order for a query to use any sort of operators (contains, begins with, > < ect...) you must have a range attributes- aka your Sort Key.
You can very well set up a GSI with some value as your PK and this code as your SK. However, GSIs are replication of the table - there is a slight potential for the data ina GSI to lag behind that of the master copy. If the query you're doing against this GSI isn't very often, then you're probably safe from that.
However. If you are trying to do this to the entire table at once then it's no better than a scan.
If what you need is a specific Code to return all its documents at once, then you could do a GSI with that as the PK. If you add a date field as the SK of this GSI it would even be time sorted. If you query against that code in that index, you'll get every single one of them.
Since you may have multiple codes, if they aren't too many per document, you maybe could use a Sparse Index - if you have an entity with code "AAAA" then you also have an attribute named AAAA (or AAAAflag or something.) It is always null/does not exist Unless the entities contains that code. If you do a GSI on this AAAflag attribute, it will only contain documents that contain that entity code, and ignore all where this attribute does not exist on a given document. This may work for you if you can also provide a good PK on this to keep the numbers well partitioned and if you don't have too many codes.
Filter expressions by the way are different than all of the above. Filter expressions are run on tbe data that would be returned, after it is already read out of the table. This is useful I'd you have a multi access pattern setup, but don't want a particular call to get all the documents associated with a particular PK - in the interests of keeping the data your code is working with concise. The query with a filter expression still retrieves everything from that query, but only presents what makes it past the filter.
If are only querying against a particular PK at any given time and you want to know if it contains any entities of x, then a Filter expressions would work perfectly. Of course, this is only per PK and not for your entire table.
If all you need is numbers, then you could do a count attribute on the document, or a meta document on that partition that contains these values and could be queried directly.
Lastly, and I have no idea if this would work or not, if your entities attribute is a map type you might very well be able to filter against entities code - and maybe even with entities.code.contains(value) if it was an SK - but I do not know if this is possible or not

is it good idea to use a binary attribute for GSI indexing in dynamo DB?

I have one attribute in my DynamoDB table which will take binary values success and failure.
can I do GSI indexing on this attribute if i have to fetch/query either success or failure records from this table?
or should i make two different table table for success and failure scenarios?
if should not do indexing on binary attribute,
what are the problems with GSI indexing of binary attribute?
how it will impact performance of query operation?
It sounds like you perhaps mean boolean (true/false) rather than binary. You cannot create a GSI on a boolean attribute in DynamoDB but you can on a string, number or binary attribute (which is different to boolean), so you can consider 1 / 0 or “accept” / “fail” for your logical boolean.
You might consider making this a sparse index if you only want to query one side of your index. So if you only want to query when there is a true (or “accept” or 1 or anything really) then when it is not true, delete the attribute rather than set it to “failure” or 0 etc. This makes queries far more performant as the index is smaller, but the limitation is you can no longer query the “failure” / false / 0 cases.
To answer your questions:
1) you can’t create an index on a boolean, use a string or number (or binary, but probably you want string or number)
2) if you only need to query one side of the boolean (e.g. “accept” but never “failure”) you can improve the performance by creating a sparse index

Does composite index also create normal indexes?

I have a requirement where I need to filter by propA and fitlter and sort by propB but never have to do just either propA or propB. I asked to not index propA and propB and created a compound index on both. But that didn't work.
As per App Engine DataStore - Compound Indexes - datastore-indexes - not working
a composite index also requires specifying the component props to be indexed. Does that mean, internally there will be 5 indexes, one for the compound index and 2 each (asc/desc) for the two props? I am trying to understand the storage requirements of a compound index.
Yes, the individual properties propA and propB have to be indexed as well.
But no, you don't have to explicitly have to create (asc and desc) indexes for them, just let the datastore automatically create the built-in indexes for them (one per property, not 2) by simply not declaring them "not indexed". From Indexes:
Built-in indexes
By default, a Datastore mode database automatically predefines an
index for each property of each entity kind. These single property
indexes are suitable for simple types of queries.
So there will be 3 indexes in your case, 2 built-in and 1 composite.

Riak inserting a list and querying a list

I was wondering if there was a effecient way of handling arrays/lists in Riak. Right now I'm storing the whole array as a string and searching the string to find out if a element exists in the array.
ID (key) : int[] (Value)
And also How do I write a map/reduce query to give all the keys for which the value array contains a element
For example 1 : 2,3,4
2 : 2,5
How would I write a M/R query to give me all the keys for which value contains 2 the result is 1,2 in this case.
Any help is appreciated
If you are searching for a specific element in the list and are using the LevelDB backend, you could create a secondary index that will contain the values of the array. Secondary indexes in Riak may contain multiple values and can be searched for equality, which should allow you to search for single elements in the array without having to resort to MapReduce.
If you need to make more complicated queries based on either several elements in the list or other parameters, you could retrieve a subset of records based on the secondary index and then process them further on the client side or perhaps even through a MapReduce job.

Resources