How does GAE datastore index null values - google-cloud-datastore

I'm concerned about read performance, I want to know if putting an indexed field value as null is faster than giving it a value.
I have lots of items with a status field. The status can be, "pending", "invalid", "banned", etc...
my typical request is to find the status "ok" (or null). Since null fields are not saved to datastore, it is already a win to avoid to have a "useless" default value I can replace with null. So I already have less disk space use.
But I was wondering, since datastore is noSql, it doesn't know about the data structure and it doesn't know there is a missing column status. So how does it do the status = null request check?
Does it have to check all columns of each row trying to find my column? or is there some smarter mechanism?
For example, index (null=Entity,key) when we pass a column explicitly saying it is null (if this is the case, does Objectify respect that and keep the field in the list when passing it to the native API if it's null?)
And mainly, which request is more efficient?

The low level API (and Objectify) stores and indexes nulls if you specify that a field/property should be indexed. For Objectify, you can specify #Ignore(IfNull.class) or #Unindex(IfNull.class) if you want to alter this behavior. You are probably confusing this with documentation for other data access APIs.
Since GAE only allows you to query for indexed fields, your question is really: Is it better to index nulls and query for them, or to query for everything and filter out non-null values?
This is purely a question of sparsity. If the overwhelming majority of your records contain null values, then you're probably better off querying for everything and filtering out the ones you don't want manually. A handful of extra entity reads are probably cheaper than updating and storing an extra index. On the other hand, if null records are a small percentage of your data, then you will certainly want the index.
This indexing dilema is not unique to GAE. All databases present this question with respect to low-cardinality fields; it's just that they'll do the table scan (testing and skipping rows) for you.
If you really want to fine-tune this behavior, read Objectify's documentation on Partial Indexes.

null is also treated as a value in datastore and there will be entries for null values in indexes. Datastore doc says, "Datastore distinguishes between an entity that does not possess a property and one that possesses the property with a null value"
Datastore will never check all columns or all records. If you have this property indexed, it will get records from the index only If not indexed, you cannot query by that property.
In terms of query performance, it should be the same, but you can always profile and check.

Related

Querying on Global Secondary indexes with a usage of contains operator

I've been reading a DynamoDB docs and was unable to understand if it does make sense to query on Global Secondary Index with a usage of 'contains' operator.
My problem is as follows: my dynamoDB document has a list of embedded objects, every object has a 'code' field which is unique:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
]
}
I want to be able to get all documents that contain entities with entity.code = X.
For this purpose I'm considering adding a Global Secondary Index that would contain all entity.codes that are present in current db document separated by a comma. So the example above would look like:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
],
"entitiesGlobalSecondaryIndex":"entityCode1,entityCode2"
}
And then I would like to apply filter expression on entitiesGlobalSecondaryIndex something like: entitiesGlobalSecondaryIndex contains entityCode1.
Would this be efficient or using global secondary index does not make sense in this way and DynamoDB will simply check the condition against every document which is similar so scan?
Any help is very appreciated,
Thanks
The contains operator of a query cannot be run on a partition Key. In order for a query to use any sort of operators (contains, begins with, > < ect...) you must have a range attributes- aka your Sort Key.
You can very well set up a GSI with some value as your PK and this code as your SK. However, GSIs are replication of the table - there is a slight potential for the data ina GSI to lag behind that of the master copy. If the query you're doing against this GSI isn't very often, then you're probably safe from that.
However. If you are trying to do this to the entire table at once then it's no better than a scan.
If what you need is a specific Code to return all its documents at once, then you could do a GSI with that as the PK. If you add a date field as the SK of this GSI it would even be time sorted. If you query against that code in that index, you'll get every single one of them.
Since you may have multiple codes, if they aren't too many per document, you maybe could use a Sparse Index - if you have an entity with code "AAAA" then you also have an attribute named AAAA (or AAAAflag or something.) It is always null/does not exist Unless the entities contains that code. If you do a GSI on this AAAflag attribute, it will only contain documents that contain that entity code, and ignore all where this attribute does not exist on a given document. This may work for you if you can also provide a good PK on this to keep the numbers well partitioned and if you don't have too many codes.
Filter expressions by the way are different than all of the above. Filter expressions are run on tbe data that would be returned, after it is already read out of the table. This is useful I'd you have a multi access pattern setup, but don't want a particular call to get all the documents associated with a particular PK - in the interests of keeping the data your code is working with concise. The query with a filter expression still retrieves everything from that query, but only presents what makes it past the filter.
If are only querying against a particular PK at any given time and you want to know if it contains any entities of x, then a Filter expressions would work perfectly. Of course, this is only per PK and not for your entire table.
If all you need is numbers, then you could do a count attribute on the document, or a meta document on that partition that contains these values and could be queried directly.
Lastly, and I have no idea if this would work or not, if your entities attribute is a map type you might very well be able to filter against entities code - and maybe even with entities.code.contains(value) if it was an SK - but I do not know if this is possible or not

Encode PartitionKey into Document Id?

I have set the partition key of one of my Cosmos DBs to /partition.
For example: We have a Chat document that contains a list of Subscribers, then we have ChatMessages that contain a text, a reference to the author and some other properties. Both documents have a partition property that contains the type 'chat' and the chats id.
Chat example:
{
"id" : "955f3eca-d28d-4f83-976a-f5ff26d0cf2c",
"name" : "SO questions",
"isChat" : true,
"partition" : "chat_955f3eca-d28d-4f83-976a-f5ff26d0cf2c",
"subscribers" : [
...
]
}
We then have Message documents like this:
{
"id" : "4d1c7b8c-bf89-47e0-83e1-a8cf0d71ce5a",
"authorId" : "some guid",
"isMessage" : true,
"partition" : "chat_955f3eca-d28d-4f83-976a-f5ff26d0cf2c",
"text" : "What should I do?"
}
It is now very convenient to return all messages for a specific chat, I just need to query all documents of the partition chat_955f3eca-d28d-4f83-976a-f5ff26d0cf2c with the property isMessage = true. All good...
But if I now want to query my db for a specific message by id, I usually just know the id, but not the partition and therefor have to run a slow crosspartition query. Which then led me to the question if I should not add the partitionKey to the message id so I can split the id when querying the db for a faster lookup. I saw that the _rid property of a document looks like a combination of the id of a db and the id of the collection and then a document specific id. What I mean by this is (simplified):
Chat.Id = "abc"
Chat.Partition = "chat_abc" //[type]_[chatId]
Message.Id = "chat_abc|123" //[Chat.Partition]|[Message.Id]
Message.Partition = chat_abc //[Chat.Partition]
Lets assume that I now want to get the Message document by the id, I just split the id by the | symbol and then query the document with the 1st part of the id as partition and the full id as the key.
Does that make sense? Are there better ways to do this? Should I just always also pass the partitionKey of a document along, not just it's id? Should I just use the _rid properties instead?
Any experience is highly appreciated!
UPDATE
I have found the following answer here:
Some applications encode partition key as part of the ID, e.g.
partition key would be customer ID, and ID = "customer_id.order_id",
so you can extract the partition key from the ID value.
I have further asked the cosmos team by email if this is a recommended pattern and post an answer, in case I get any.
Yes, your proposal to extract partition key from id (via a convention like a prefix/delimiter) makes sense. This is common among applications that have a single key and want to refactor it to use Cosmos DB from a different storage system.
If you're building your application from scratch, you should consider wiring the composite key (partition key + item key ("id")) through your API/application.
First, if you know your data (and index) size) will remain within the 10gb limit and you RU/sec limit is ok, then a fixed partition-less collection will bypass this problem. Probably OP has knowlingly made the decision that partitioning is required, but it is an important consideration to note for generalization purposes. If possible, KISS ;)
If partitioning is a must, then AFAIK you cannot avoid crosspartition split and its overhead unless you know the partition key.
Imho the OP suggestion of merging the duplicated partition key into id field is a rather ugly solution, because:
Name id implies it is unique key, partition key is not part of it or necessary for this key and its uniqueness. Anyone using this key upstream would incur the forced excess cost of longer key, blocked from using the simpler Guid type, etc.
It will become a mess should your partitioning key change in future.
The internal structure of merged id would not be intuitive without documentation - it's parts are not named and even if they look like to have a pattern new devs would not know for sure without finding external documentation to reliably understand what's going on.
Your data model does not require this duplication on semantic level, it would be for your application querying comfort and hence such hacks should belong to your application code, not data model. Such leaking concerns should be avoided if possible.
Data duplication within document would unnecessarily increase document size, bandwidth, etc. (may or may not be notable, depending on scale and usage). in-document duplication is necessary at times, but imho not necessarily in this case.
A better design would be to ensure the partition key is always present in logic context and could be passed along to lookups. If you don't have it available, then maybe you should refactor you application code (not data design) to explicitly pass around the chatId along with id where needed. That is WITHOUT merging them together into some opaque string format.
Also, I don't see a good way to use _rid for this as if I remember correctly, it did not contain any internal reference to a partition or partition key.
Disclaimer: I don't have any access or deep insight into internal CosmosDB index design or _rid logic on partitioned collections. I may have misunderstood how it works.

Cloud Firestore whereNotEqual

Does Firestore support something like whereNotEqual?
For example, I need to get exact documents where key "xyz" is missing.
In Firebase realtime db, we could get it by calling *.equalTo(null).
Thanks.
Firestore does not support a direct equivalent of !=. The supported query operators are <, <=, ==, >, or >= so there's no "whereNotEqual".
You can test if a field exists at all, because all filters and order bys implicitly create a filter on whether or not a field exists. For example, in the Android SDK:
collection.orderBy("name")
would return only those rows that contain a "name" field.
As with explicit comparison there's no way to invert this query to return those rows where a value does not exist.
There are a few work-arounds. The most direct replacement is to explicitly store null then query collection.whereEqualTo("name", null). This is somewhat annoying though because if you don't populate this from the outset you have to backfill existing data once you want to do this. If you can't upgrade all your clients you'll need to deploy a function to keep this field populated.
Another possibility is to observe that usually missing fields indicate that a document is only partially assembled perhaps because it goes through some state machine or is a sort of union of two non-overlapping types. If you explicitly record the state or type as a discriminant you can query on that rather than field non-presence. This works really well when there are only two states/types but gets messy if there are many states.
Cloud Firestore now supports whereNotEqualTo in database queries.
Keep in mind if you have more than one field in your query you may have to create a composite index in Cloud Firestore.

How can I Scan an index in reverse in DynamoDB?

I am currently using DynamoDB and having a problem scanning. I am able to get paged results in forward order by using the ExclusiveStartKey. However, regardless of whether I set ScanIndexForward true or false, I get results in forward order from my scan operation. How can i get results in reverse order from a Scan in DynamoDB?
ScanIndexForward is the correct way to get items in descending order by the range key of the table or index you are querying. From the AWS API Reference:
A value that specifies ascending (true) or descending (false)
traversal of the index. DynamoDB returns results reflecting the
requested order determined by the range key. If the data type is
Number, the results are returned in numeric order. For type String,
the results are returned in order of ASCII character code values. For
type Binary, DynamoDB treats each byte of the binary data as unsigned
when it compares binary values.
Based on the docs for Scan, I conclude that there is no way to Scan in reverse. However, I would say that you are not using DynamoDB correctly if you need to do that. When designing a schema for a database like DyanmoDB you should plan the schema based on your expected queries to ensure that almost all application queries have a good index. Scans are meant more for sys admin operations or for feeding into MapReduce or analytics. "A Scan operation always scans the entire table, then filters out values to provide the desired result, essentially adding the extra step of removing data from the result set." (Query and Scan Performance) That can lead to performance problems and other issues.
Using DynamoDB is fundamentally different from working with a traditional relational database and requires a big change in the way you think about using it. You need to decide whether DynamoDB's advantages of availability in storage and performance, reliability and availability are worth accepting its limitations.
As of now the dynamoDB scan cannot return you sorted results.
You need to use a query with a new global secondary index (GSI) with a hashkey and range field. The trick is to use a hashkey which is assigned the same value for all data in your table.
I recommend making a new field for all data and calling it "Status" and set the value to "OK", or something similar.
Then your query to get all the results sorted would look like this:
{
TableName: "YourTable",
IndexName: "Status-YourRange-index",
KeyConditions: {
Status: {
ComparisonOperator: "EQ",
AttributeValueList: [
"OK"
]
}
},
ScanIndexForward: false
}
The docs for how to write GSI queries are found here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.Querying

SQLite - Get a specific row index for a Sorted/Filtered Query

I'm creating a caching system to take data from an SQLite database table using a sorted/filtered query and display it. The tables I'm pulling from can be potentially very large and, of course, I need to minimize impact on memory by only retaining a maximum number of rows in memory at any given time. This is easily done by using LIMIT and OFFSET to load only the records I need and update the cache as needed. Implementing this is trivial. The problem I'm having is determining where the insertion index is for a new record inserted into a particular query so I can update my UI appropriately. Is there an easy way to do this? So far the ideas I've had are:
Dump the entire cache, re-count the Query results (there's no guarantee the new row will be included), refresh the cache and refresh the entire UI. I hope it's obvious why that's not really desirable.
Use my own algorithm to determine whether the new row is included in the current query, if it is included in the current cached results and at what index it should be inserted into if it's within the current cached scope. The biggest downfall of this approach is it's complexity and the risk that my own sorting/filtering algorithm won't match SQLite's.
Of course, what I want is to be able to ask SQLite: Given 'Query A' what is the index of 'Row B', without loading the entire query results. However, so far I haven't been able to find a way to do this.
I don't think it matters but this is all occurring on an iOS device, using the objective-c programming language.
More Info
The Query and subsequent cache is based off of user input. Essentially the user can re-sort and filter (or search) to alter the results they're seeing. My reticence in simply recreating the cache on insertions (and edits, actually) is to provide a 'smoother' UI experience.
I should point out that I'm leaning toward option "2" at the moment. I played around with creating my own caching/indexing system by loading all the records in a table and performing the sort/filter in memory using my own algorithms. So much of the code needed to determine whether and/or where a particular record is in the cache is already there, so I'm slightly predisposed to use it. The danger lies in having a cache that doesn't match the underlying query. If I include a record in the cache that the query wouldn't return, I'll be in trouble and probably crash.
You don't need record numbers.
Save the values of the ordered field in the first and last records of the LIMITed query result.
Then you can use these to check whether the new record falls into this range.
In other words, assuming that you order by the Name field, and that the original query was this:
SELECT Name, ...
FROM mytab
WHERE some_conditions
ORDER BY Name
LIMIT x OFFSET y
then try to get at the new record with a similar query:
SELECT 1
FROM mytab
WHERE some_conditions
AND PrimaryKey = LastInsertedValue
AND Name BETWEEN CachedMin AND CachedMax
Similarly, to find out before (or after) which record the new record was inserted, start directly after the inserted record and use a limit of one, like this:
SELECT Name
FROM mytab
WHERE some_conditions
AND Name > MyInsertedName
AND Name BETWEEN CachedMin AND CachedMax
ORDER BY Name
LIMIT 1
This doesn't give you a number; you still have to check where the returned Name is in your cache.
Typically you'd expect a cache to be invalidated if there were underlying data changes. I think dropping it and starting over will be your simplest, maintainable solution. I would recommend it unless you have a very good reason.
You could write another query that just returned the row count (example below) to see if your cache should be invalidated. That would save recreating the cache when it did not change.
SELECT name,address FROM people WHERE area_code=970;
SELECT COUNT(rowid) FROM people WHERE area_code=970;
The information you'd need from sqlite to know when your cache was invalidated would require some rather intimate knowledge of how the query and/or index was working. I would say that is fairly high coupling.
Otherwise, you'd want to know where it was inserted with regards to the sorting. You would probably key each page on the sorted field. Delete anything greater than the insert/delete field. Any time you change the sorting you'd drop everything.
Something like the below would be a start if you were using C++. I realize you aren't doing C++, but hopefully it is evident as to what I'm trying to do.
struct Person {
std::string name;
std::string addr;
};
struct Page {
std::string key;
std::vector<Person> persons;
struct Less {
bool operator()(const Page &lhs, const Page &rhs) const {
return lhs.key.compare(rhs.key) < 0;
}
};
};
typedef std::set<Page, Page::Less> pages_t;
pages_t pages;
void insert(const Person &person) {
if (sql_insert(person)) {
pages_t::iterator drop_cache_start = pages.lower_bound(person);
//... drop this page and everything after it
}
}
You'd have to do some wrangling to get different datatypes of key to work nicely, but its possible.
Theoretically you could just leave the pages out of it and only use the objects themselves. The database would no longer "own" the data though. If you only fill pages from the database, then you'll have less data consistency worries.
This may be a bit off topic, you aren't re-implementing views are you? It doesn't cache per se, but it isn't clear if that is a requirement of your project.
The solution I came up with is not exactly simple, but it's currently working well. I realized that the index of a record in a Query Statement is also the Count of all it's previous records. What I needed to do was 'convert' all the ORDER statements in the query to a series of WHERE statements that would return only the preceding records and take a count of those records. It's trickier than it sounds (or maybe not...it sounds tricky). The biggest issue I had was making sure the query was, in fact, sorted in a way I could predict. This meant I needed to have an order column in the Order Parameters that was based off of a column with unique values. So, whenever a user sorts on a column, I append to the statement another order parameter on a unique column (I used a "Modified Date Stamp") to break ties.
Creating the WHERE portion of the statement requires more than just tacking on a bunch of ANDs. It's easier to demonstrate. Say you have 3 Order columns: "LastName" ASC, "FirstName" DESC, and "Modified Stamp" ASC (the tie breaker). The WHERE statement would have to look something like this ('?' = record value):
WHERE
"LastName" < ? OR
("LastName" = ? AND "FirstName" > ?) OR
("LastName" = ? AND "FirstName" = ? AND "Modified Stamp" < ?)
Each set of WHERE parameters grouped together by parenthesis are tie breakers. If, in fact, the record values of "LastName" are equal, we must then look at "FirstName", and finally "Modified Stamp". Obviously, this statement can get really long if you're sorting by a bunch of order parameters.
There's still one problem with the above solution. Mathematical operations on NULL values always return false, and yet when you sort SQLite sorts NULL values first. Therefore, in order to deal with NULL values appropriately you've gotta add another layer of complication. First, all mathematical equality operations, =, must be replace by IS. Second, all < operations must be nested with an OR IS NULL to include NULL values appropriately on the < operator. This turns the above operation into:
WHERE
("LastName" < ? OR "LastName" IS NULL) OR
("LastName" IS ? AND "FirstName" > ?) OR
("LastName" IS ? AND "FirstName" IS ? AND ("Modified Stamp" < ? OR "Modified Stamp" IS NULL))
I then take a count of the RowID using the above WHERE parameter.
It turned out easy enough for me to do mostly because I had already constructed a set of objects to represent various aspects of my SQL Statement which could be assembled to generate the statement. I can't even imagine trying to manipulate a SQL statement like this any other way.
So far, I've tested using this on several iOS devices with up to 10,000 records in a table and I've had no noticeable performance issues. Of course, it's designed for single record edits/insertions so I don't really need it to be super fast/efficient.

Resources