I'm doing some early trials on Cosmos, and populated a table with a set of DTOs. While some simple WHERE queries seem to return quite quickly, others are horribly inefficient. A simple COUNT(1) from c took several secons and used over 10K request units. Even worse, doing a little experiment with ordering also was very discouraging. Here's my query
SELECT TOP 20 c.Foo, c.Location from c
ORDER BY c.Location.Position.Latitude DESC
My collection (if the count was correct, I got super weird results running it while populating the DB, but that's another issue) contains about 300K DTOs. The above query ran for about 30 seconds (I currently have the DB configured to perform with 4K RU/s), and ate 87453.439 RUs with 6 round trips. Obviously, that's a no-go.
According to the documentation, the numeric Latitute property should be indexed, so I'm not sure it's me screwing up here, or the reality didn't really catch up with the marketing here ;)
Any idea on why this doesn't perform properly? Thanks for your advice!
Here's a document as returned:
{
"Id": "y-139",
"Location": {
"Position": {
"Latitude": 47.3796977,
"Longitude": 8.523499
},
"Name": "Restaurant Eichhörnli",
"Details": "Nietengasse 16, 8004 Zürich, Switzerland"
},
"TimeWindow": {
"ReferenceTime": "2017-07-01T15:00:00",
"ReferenceTimeUtc": "2017-07-01T15:00:00+02:00",
"Direction": 0,
"Minutes": 45
}
}
The DB/collection I use is just the default one that can be created for the ToDo application from within the Azure portal. This apparently created the following indexing policy:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Hash",
"dataType": "String",
"precision": 3
}
]
}
],
"excludedPaths": []
}
Update as of Dec 2017:
I revisited my unchanged database and ran the same query again. This time, it's fast and instead of > 87000 RUs, it eats up around 6 RUs. Bottom line: It appears there was something very, very wrong with Cosmos DB, but whatever it was, it seems to be gone.
Related
I'm using firebase firestore using Rest API to get data limited by 5 documents only, ordered by a field called LikesCount.
When I want to fetch the next 5 documents I have to use startAt and pass the LikesCount value of the last document from the first 5 documents.
But in this way, it will fetch wrong data when there is another document with the same LikesCount value So I tried and searched a lot about how to pass the last Document id in addition to the LikesCount value But all of them did not work In addition, I tested the pagination using the Web SDK and it was working correctly because you can pass the document snapshot easily, but what does the document snapshot object include? So that we can understand the structure of the Cursor and apply it to the REST API.
I tried to use this method to pass the Document ID as referenceValue
{
"structuredQuery": {
"from": [{
"collectionId": "Users"
}],
"where": {
"compositeFilter": {
"op": "AND",
"filters": []
}
},
"orderBy": [{
"field": {
"fieldPath": "LikesCount"
},
"direction": "DESCENDING"
}],
"startAt":
{ "values": [
{
"integerValue": "6"
},
{
"referenceValue": "projects/myprojectid/databases/(default)/documents/Posts/xEvmJ1LLHwTKVREQfXtX"
}
],
"before": false
},
"limit":5
}
}
But an error will occur : Cursor has too many values.
also, I tried to pass only the referenceValue and it still did not return the correct 5 documents.
Thanks in advance :)
Your orderBy() has 1 field (LikesCount) but your startAt() has 2 fields. I suspect that is the reason for the error message?
Passing the integerValue won't work. If there are 13 results with the value 6, then each time you make the above call you'd get the same first 5 results.
When you say:
I tried only passing the referenceValue and still did not get the correct 5 documents
what documents are you getting? What documents were you expecting to get?
Spatial indexing does not seem to be working on a collection which contains a document with GeoJson coordinates. I've tried using the default indexing policy which inherently provides spatial indexing on all fields.
I've tried creating a new Cosmos Db account, database, and collection from scratch without any success of getting the spatial indexing to work with ST_DISTANCE query.
I've setup a simple collection with the following indexing policy:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/\"location\"/?",
"indexes": [
{
"kind": "Spatial",
"dataType": "Point"
},
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
}
]
}
],
"excludedPaths": [
{
"path": "/*",
},
{
"path": "/\"_etag\"/?"
}
]
}
The document that I've inserted into the collection:
{
"id": "document1",
"type": "Type1",
"location": {
"type": "Point",
"coordinates": [
-50,
50
]
},
"name": "TestObject"
}
The query that should return the single document in the collection:
SELECT * FROM f WHERE f.type = "Type1" and ST_DISTANCE(f.location, {'type': 'Point', 'coordinates':[-50,50]}) < 200000
Is not returning any results. If I explicitly query without using the spatial index like so:
SELECT * FROM f WHERE f.type = "Type1" and ST_DISTANCE({'type': 'Point', 'coordinates':[f.location.coordinates[0],f.location.coordinates[1]]}, {'type': 'Point', 'coordinates':[-50,50]}) < 200000
It returns the document as it should, but doesn't take advantage of the indexing which I will need because I will be storing a lot of coordinates.
This seems to be the same issue referenced here. If I add a second document far away and change the '<' to '>' in the first query it works!
I should mention this is only occurring on Azure. When I use the Azure Cosmos Db Emulator it works perfectly! What is going on here?! Any tips or suggestions are much appreciated.
UPDATE: I found out the reason that the query works on the Emulator and not Azure - the database on the emulator doesn't have provisioned (shared) throughput among its collections, while I made the database in Azure with provisioned throughput to keep costs down (i.e. 4 collections sharing 400 RU/s). I created a non provisioned throughput database in Azure and the query works with spatial indexing!! I will log this issue with Microsoft to see if there is a reason why this is the case?
Thanks for following up with additional details with regards to a fixed collection being the solution but, I did want to get some additional information.
The Cosmos DB Emulator now supports containers:
By default, you can create up to 25 fixed size containers (only supported using Azure Cosmos DB SDKs), or 5 unlimited containers using the Azure Cosmos Emulator. By modifying the PartitionCount value, you can create up to 250 fixed size containers or 50 unlimited containers, or any combination of the two that does not exceed 250 fixed size containers (where one unlimited container = 5 fixed size containers). However it's not recommended to set up the emulator to run with more than 200 fixed size containers. Because of the overhead that it adds to the disk IO operations, which result in unpredictable timeouts when using the endpoint APIs.
So, I want to see which version of the Emulator you were using. Current version is azure-cosmosdb-emulator-2.2.2.
I have a timestamp column in DocDb, I would like to query that in Azure Data Factory copy pipeline, which copies DocDb to Azure Data Lake
I would like to
select * from c
where c._ts > '#{pipeline().parameters.windowStart}'
But I got
Errors":["An invalid query has been specified with filters against path(s) that are not range-indexed.
In the DocDb policy, I have
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Hash",
"dataType": "String",
"precision": 3
}
]
}
]
I think this should allow _ts int64 to be queried by range.
Where did I go wrong?
Thanks.
I reproduce your issue with your sql and your index policy.
Based on my observation, it seems that the filter is treated as String,not Int.You could remove the ' in your sql and try again,it works for me.
sql:
select * from c
where c._ts > #{pipeline().parameters.windowStart}
Output:
Thanks, #Jay.
I ended up using UDF
function dateTime2Epoch(dateTimeString){
return Math.trunc(new Date(dateTimeString).getTime()/1000);
}
in Cosmos db.
Then in Azure Data Factory query
select * from c
where c._ts >= udf.dateTime2Epoch('#{pipeline().parameters.windowStart}')
and c._ts < udf.dateTime2Epoch('#{pipeline().parameters.windowEnd}')
However, the query seems to be very slow. I will update this when I found more.
Update: Ended up with copying the whole thing.
I'm using JMSSerializer and FOSRestBundle. I have a fairly typical object graph, including some recursion.
What I would like to accomplish is that included objects beyond a certain depth or in general are listed only with their ID, but when serialized directly, with all data.
So, for example:
Users => Groups => Users
when requesting /user/1 the result should be something like
{ "id": 1, "name": "John Doe", "groups": [ { "id": 10 }, { "id": 11 } ] }
While when I request /group/10 it would be:
{ "id": 10, "name": "Groupies", "users": [ { "id": 1 }, { "id": 2 }, { "id": 4 } ] }
With #MaxDeph I can hide the included arrays completely, so I get
{ "id": 1, "name": "John Doe", "groups": [] }
But I would like to include just the IDs so that the REST client can fetch them if it needs them, or consult his cache, or do whatever.
I know I can manually cobble this together using groups, but for consistency reasons I was wondering if I can somehow enable this behaviour in my entire application, maybe even with a reference to maxdepth so I can control where to include IDs and where to include full objects?
For the sake of those finding this:
I found no other solution, but doing this with groups works just fine and gives me the result I was looking for.
I'm trying to learn to use Freebase, however when I try and do a sort by "/people/person/date_of_birth" for a search for actors for a show, it returns:
"code": 400,
"message": "Must sort on a single value, not at /tv/tv_program/regular_cast./tv/regular_tv_appearance/actor./people/person/date_of_birth"
Here is the full MQL query:
[{
"id": "/m/0524b41",
"name": [],
"sort":"/tv/tv_program/regular_cast./tv/regular_tv_appearance/actor./people/person/date_of_birth",
"/tv/tv_program/regular_cast": [{
"/tv/regular_tv_appearance/actor": [{
"name": [],
"/people/person/date_of_birth": []
}]
}]
}]
But you're not asking to sort on /people/person/date_of_birth, you're asking to sort on that long nested expression which goes through multiple intermediary nodes, some of which can appear multiple times (as indicated by the [] array notation). It's this multiplicity that MQL is complaining about.
To fix it, take your query, paste it into the query editor, click on the innermost /person/date_of_birth and then click "Invert Query." That will turn the query inside out and give you something that looks like this:
[{
"name": [],
"/people/person/date_of_birth": [],
"type": "/tv/tv_actor",
"!/tv/regular_tv_appearance/actor": [{
"!/tv/tv_program/regular_cast": [{
"id": "/m/0524b41",
"name": [],
"sort": "/tv/tv_program/regular_cast./tv/regular_tv_appearance/actor./people/person/date_of_birth"
}]
}]
}]
which isn't exactly what you want, but indicates the general shape of your target query.
Getting rid of the array brackets for single valued properties and moving the sort clause to the outside gives us:
[{
"name": null,
"/people/person/date_of_birth": null,
"sort": "/people/person/date_of_birth",
"type": "/tv/tv_actor",
"!/tv/regular_tv_appearance/actor": [{
"!/tv/tv_program/regular_cast": [{
"id": "/m/0524b41",
"name": null
}]
}]
}]
which is functional and returns our 81 regular Game of Thrones actors sorted by birth date, but could still be cleaned up a bit more. The !inverse property notation isn't necessary since we have forward equivalents and we don't really need to get the Game of Thrones info over and over again since it's constant and we really just want to use it as a filter.
Incorporating these final tweaks gives us a final query like this which returns nice compact results:
[{
"name": null,
"/people/person/date_of_birth": null,
"sort": "/people/person/date_of_birth",
"type": "/tv/tv_actor",
"starring_roles": [{
"series": {
"id": "/m/0524b41"
},
"limit": 0
}]
}]
The "limit": 0 clause is a little trick to cause MQL to use that subquery for filtering, but not bother returning any of the (constant) information in the results. The /tv/tv_actor/starring_roles and /tv/regular_tv_appearance/series can be abbreviated to the simple property names because their types are implied by their context.
Since there are only 81 results, MQL's default limit of 100 is plenty and we don't need to worry about increasing it or using cursors.
Oldest Game of Thrones actor: Peter Vaughn, born 1923.
Youngest: Lino Facioli b. 2000
Note that 7 actors don't have birth dates in Freebase, so we don't know where they rank age-wise. Here's a bonus query which returns their names and ids as well as their character's name. If we were running a production system, we might use something like this to feed a human curation queue to fill in the gaps.
[{
"name": null,
"/people/person/date_of_birth": {
"value": null,
"optional": "forbidden"
},
"type": "/tv/tv_actor",
"starring_roles": [{
"series": {
"id": "/m/0524b41"
},
"character":null
}]
}]
The seven character/actor pairs are (were): Roose Bolton - Michael McElhatton,
Gregor Clegane - Conan Stevens,
Hizdahr zo Loraq - Joel Fry,
Rickon Stark - Art Parkinson,
Janos Slynt - Dominic Carter,
Hodor - Kristian Nairn,
Tommen Baratheon - Callum Wharry. I say "were" because I couldn't resist fixing Hodor's birth date. The strange thing is that it was in Wikipedia, so should have been picked up automatically by Freebase. I think there's a bug lurking there somewhere.