I have a Datastore (Firestore in Datastore mode) database/entity with the following composite index:
- kind: spider-results
properties:
- name: created
direction: desc
- name: meta
- name: score
This has been applied two days ago, but my index doesn't contain any entities.
The entities have more attributes than the three contained in the index. In Python using google-cloud-datastore version 1.7.1 I was trying to fetch entities. To make queries as fast as possible and return only the fields contained in the index, I set the projection argument:
client = datastore.Client.from_service_account_json(credentials_path)
query = client.query(kind='spider-results', order=['-created'],
projection=['created', 'meta', 'score'])
for entity in query.fetch(eventual=True):
print(entity.key.name)
However, this yields no result. My understanding is that this is caused by the matching index being empty.
When I remove the projection argument, the query yields results (containing some relatively large fields which cause significant data transfer, slowing down the entire query.
Related
I have a CosmosDb and a Synapse workspace linked. Everything almost works using Synapse to create SQL views to the Cosmos data.
In Cosmos I have one data set with a property that is always a zero. I know it is actually a decimal because it is a price and future data is likely to contain decimal prices.
In Synapse I need to project this data into an SQL view where that column is correctly a decimal(19,4).
When I run an OpenRowSet query into the Cosmos data and attempt to specify the type for this property I get the following error.
select *
from OPENROWSET(
'CosmosDb',
'account=myaccount;database=myDatabase;region=theRegion;key=xxxxxxxxxxxxxxx',
[myCollection])
with (
[salesPrice] float '$.salesPrice')
as testQuery
I get the error:
Column 'salesPrice' of type 'FLOAT' is not compatible with external data type 'Parquet physical type: INT64', please try with 'BIGINT'.
Obviously a BIGINT here is going to fail as soon as I get a true decimal price.
I think the parquet type is getting set to BIGINT because in Cosmos all the values for this column are zero. I guess more generally it would be the same problem if the Cosmos property was all non-zero integers.
How can I force the type of salesPrice to be a decimal or float?
(I don't want to get side tracked here on float vs decimal for monetary values, I understand the difference; this error happens either way)
UPDATE
This problem manifests itself also in another way without specifying a schema with OPENROWSET.
In a new CosmosDb collection insert a document such as:
{
"myid" : 1,
"price" : 0
}
If I wait a minute or so I can query this document from Synapse with:
select *
from OPENROWSET(
'myCosmosDb',
'account=myAccount;database=myDatabase;region=myRegion;key=xxxxxxxxxxxxxxxxxxx',
[myCollection])
as testQuery;
and I get the expected results.
Now add a second document:
{
"myid" : 1,
"price" : 1.1
}
and re-run the query and I get the same error:
Column 'price' of type 'FLOAT' is not compatible with external data type 'Parquet physical type: INT64', please try with 'BIGINT'
Is there any way to work around or prevent these kinds of errors?
How about set the document like
{
"myid" : "1",
"price" : "1.1"
}
I am moving my database from a sql database to Dynamodb. I currently have a table with those values:
tenantId (PartitionKey)
resourceId (RangeKey)
type
role
name
I have the following query at the moment:
get all the resources belonging to a tenant ten that has type t, role r and name contains n. Where type role name may be null values, so in that case those are not used as filters.
Using filters it is possible to make this query in dynamodb, but reading the following article https://aws.amazon.com/blogs/database/querying-on-multiple-attributes-in-amazon-dynamodb/ I realized it may be an expensive query as dynamodb is retrieving those data and then filtering server side. That page suggests to create a GSI with the following value:
tenantId-type-role-name
With this index I can easily filter for ten t r n but in case I just have to filter for tenantId type name how should I query the GSI to get all the records that have tenant ten type t, and name contains n but have no restrictions on role (contains statement seems only to be supported on filters).
I am wondering if I need to create a GSI for each combination, something like:
tenantId-type
tenantId-role
tenantId-name
tenantId-type-role
...
Thanks in advance for your help
Before you build GSIs to make your querying simpler. Think about storing your data in a different format.
For example how many resources do you expect per tenant? Could you store your data as such:
{
tenant: 123, //(partition)
resources: [
{ type: 'type1', role: 'role1', name: 'somename1'},
{ type: 'type2', role: 'role2', name: 'somename2'},
{ type: 'type3', role: 'role3', name: 'somename3'}
]
}
In the format above your read times will be rapid and scale. You can then filter your contains logic in code. Your dynamodb records can be 400kb in size, so you could probably store several thousands resources in the above format per record.
Also note each GSI has its own read/write unit usage that is used up when you insert into the table. If you do the GSI approach and write a lot to that table you'll have a surprisingly high write usage.
With Cosmos DB for MongoDB API (Version 3.4), the following find query in combination with the method cursor sort seems to behave incorrectly:
db.test.find({"field1": "value1"}).sort({"field2": 1})
The error occurs, if all of the following conditions are met:
the default indexing policy were discarded - regardless of whether custom indexes were created afterwards using createIndex().
The find() query does not return any documents (Find(filter).Count() == 0)
The Sort document defining the sort order contains only one field. It doesn't matter, whether this field exists or has been indexed. Using two fields in the sort document returns 0 hits which is the correct behavior.
The error also occurs, if all of the following conditions are met:
the default indexing policy were discarded
The find() query returns one or more documents
The Sort document contains exactly one field. This field has not been indexed.
The error message:
The index path corresponding to the specified order-by item is excluded.
The malfunction occurs only when using the CosmosDB, with native MongoDB (mongoDB Atlas, v4.0) it behaves correctly.
Azure Cosmos DB for MongoDB API with MongoDB 3.4 wire protocol (preview feature) is used. The problem occurs with both a MongoDB C#/.NET driver and the mongo shell.
In addition, the problem only occurs with find(). An equivalent aggregation pipeline containing $match and $sort behaves correctly.
Reproduction
Create an Azure Cosmos DB Account with the "Azure Cosmos DB for MongoDB API". Enable the preview feature MongoDB 3.4 (Version 3.2 has not been tested).
Create a new database
Create a new collection, define a shard key
Drop the default indexing policy (using db.test.dropIndexes() )
(Optional) Create new custom indexes
(Optional) Insert documents
Execute command in mongo shell (or the equivalent code with mongoDB C#/.NET driver):
db.test.find({"field1": "value1"}).sort({"field2": 1})
Expected result
All documents that match the query criteria. If there are none, no documents should be returned.
Actual result
Error: error: {
"_t" : "OKMongoResponse",
"ok" : 0,
"code" : 2,
"errmsg" : "Message: {\"Errors\":[\"The index path corresponding to the specified order-by item is excluded.\"]}\r\nActivityId: c50cc751-0000-0000-0000-000000000000, Request URI: /apps/[...]/, RequestStats: \r\nRequestStartTime: 2019-07-11T08:58:48.9880813Z, RequestEndTime: 2019-07-11T08:58:49.0081101Z, Number of regions attempted: 1\r\nResponseTime: 2019-07-11T08:58:49.0081101Z, StoreResult: StorePhysicalAddress: rntbd://[...]/, LSN: 359549, GlobalCommittedLsn: 359548, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 400, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#359549, UsingLocalLSN: True, TransportException: null, ResourceType: Document, OperationType: Query\r\n, SDK: Microsoft.Azure.Documents.Common/2.4.0.0", [...]
Workaround
Adding an additional "dummy" field to the sort document prevents the error:
db.test.find({"field1": "value1"}).sort({"field2": 1, "dummyfield": 1}).count()
The workaround is not satisfactory. It could falsify the result.
Am I doing something wrong, or is Cosmos DB behaving flawed here?
According to Microsoft support, an index needs to be created on the field being sorted. The default indexes can be dropped and custom indexes created. As for the issue of not modifying the index every time a new field is added, there is no other alternative other than performing a client side sort. Unfortunately, client side sorting would take lot of CPU memory on the client side and the sort on index would take work when you would get more fields to index.
Thus I did not find a really satisfying solution:
Using the Default Indexing Policy. However, this can lead to a huge index.
Indexing all elements that need to be sorted. Every time a new element has to be indexed, this leads to a manual modification of the indexing policy.
Only use Client-side sort. In my opinion this leads to a strong limitation of MongoDB functionality.
Using aggregation frameworks instead of the find method. This leads to increased complexity and traffic.
Migrating to native MongoDB.
db.collection.createIndex ({ "$**" : 1 });
I am using Python client SDK for Datastore (google-cloud-datastore) version 1.4.0. I am trying to run a key-only query fetch:
query = client.query(kind = 'SomeEntity')
query.keys_only()
Query filter has EQUAL condition on field1 and GREATER_THAN_OR_EQUAL condition on field2. Ordering is done based on field2
For fetch, I am specifying a limit:
query_iter = query.fetch(start_cursor=cursor, limit=100)
page = next(query_iter.pages)
keyList = [entity.key for entity in page]
nextCursor = query_iter.next_page_token
Though there are around 50 entities satisfying this query, each fetch returns around 10-15 results and a cursor. I can use the cursor to get all the results; but this results in additional call overhead
Is this behavior expected?
keys_only query is limited to 1000 entries in a single call. This operation counts as a single entity read.
For another limitations of Datastore, please refer detailed table in the documentation.
However, in the code, you did specify cursor as a starting point for a subsequent retrieval operation. Query can be limited, without cursor:
query = client.query()
query.keys_only()
tasks = list(query.fetch(limit=100))
For detailed instruction how to use limits and cursors, please refer documentation of the Google Gloud Datastore
I have a dynamo table with the following attributes :
id (Number - primary key )
title (String)
created_at (Number - long)
tags (StringSet - contains a set of tags say android, ios, etc.,)
I want to be able to query by tags - get me all the items tagged android. How can I do that in DynamoDB? It appears that global secondary index can be built only on ScalarDataTypes (which is Number and String) and not on items inside a set.
If the approach I am taking is wrong, an alternative way for doing it either by creating different tables or changing the attributes is also fine.
DynamoDB is not designed to optimize indexing on set values. Below is a copy of the amazon's relevant documentation (from Improving Data Access with Secondary Indexes in DynamoDB).
The key schema for the index. Every attribute in the index key schema
must be a top-level attribute of type String, Number, or Binary.
Nested attributes and multi-valued sets are not allowed. Other
requirements for the key schema depend on the type of index: For a
global secondary index, the hash attribute can be any scalar table
attribute. A range attribute is optional, and it too can be any scalar
table attribute. For a local secondary index, the hash attribute must
be the same as the table's hash attribute, and the range attribute
must be a non-key table attribute.
Amazon recommends creating a separate one-to-many table for these kind of problems. More info here : Use one to many tables
This is a really old post, sorry to revive it, but I'd take a look at "Single Table Design"
Basically, stop thinking about your data as structured data - embrace denormalization
id (Number - primary key )
title (String)
created_at (Number - long)
tags (StringSet - contains a set of tags say android, ios, etc.,)
Instead of a nosql table with a "header" of this:
id|title|created_at|tags
think of it like this:
pk|sk |data....
id|id |{title, created_at}
id|id+tag|{id, tag} <- create one record per tag
You can still return everything by querying for pk=id & sk begins with id and join the tags to the id records in your app logic
and you can use a GSI to project id|id+tag into tag|id which will still require you to write two queries against your data to get items of a given tag (get the ids then get the items), but you won't have to duplicate your data, you wont have to scan and you'll still be able to get your items in one query when your access pattern doesn't rely on tags.
FWIW I'd start by thinking about all of your access patterns, and from there think about how you can structure composite keys and/or GSIs
cheers
You will need to create a separate table for this query.
If you are interested in fetching all items based on a tag then I suggest keeping a table with a primary key:
hash: tag
range: id
This way you can use a very simple Query to fetch all items by tag.