Top N per Classification in CosmosDB - azure-cosmosdb

I'm kinda stuck on this issue. I have several hundreds of a certain model stored in ComsosDb and I can't seem to get the top 5 of each category.
This is the model:
"id": "06224840-6b88-4394-9324-4d1628383702",
"name": "Reservation",
"description": null,
"client": null,
"reference": null,
"isMonitoring": false,
"monitoringSince": null,
"hasRiskProfile": false,
"riskProfile": -1,
"monitorFrequency": 0,
"mainBindable": null,
"organizationId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"userId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"createDate": "2020-08-18T11:00:02.5266403Z",
"updateDate": "2020-08-18T11:00:02.5266419Z",
"lastMonitorDate": "2020-08-18T11:00:02.5266427Z"
So what i'm trying to do is use C# to get the top 5 from each risk profile where the organizationId matches. GroupBy through LINQ throws an error, same with a row_number() query combined with a PARTITION BY, doesn't seem to work either.
Any way I can get this to work in a single query compatible with cosmos?
EDIT:
What i am trying to achieve in CosmosDb is this roughly:
WITH TopEntries AS (
SELECT *
,ROW_NUMBER() OVER (
PARTITION BY [riskProfile]
ORDER BY [updateDate] DESC
) AS [ROW NUMBER]
WHERE [organizationId] = "xyz"
FROM [reservations]
)
SELECT * FROM TopEntries
WHERE TopEntries.[ROW NUMBER] <= 5

It sounds like combining TOP and ORDER BY would do the job. For example:
SELECT TOP 5 *
FROM c
WHERE c.organizationId = "xyz"
ORDER BY c.riskProfile
You can build such queries with parameters in the .NET SDK as in this sample.

The functionality you are trying to achieve is not directly possible through single query in Cosmos DB. There are 2 steps to do this(You can change as per you document sets)
Firstly you will have to group by like below:
SELECT c.city FROM c where c.org = 'xyz' group by c.city
Then loop through the result one by one from the first query like below:
SELECT TOP 5 * FROM C WHERE C.city = 'delhi' order by C.date desc
You can refer to similar issue here:
https://learn.microsoft.com/en-us/answers/questions/38454/index.html

Related

Cosmos DB query on key-value pairs

I have a large collection of json documents whose structure is in the form:
{
"id": "00000000-0000-0000-0000-000000001122",
"typeId": 0,
"projectId": "p001",
"properties": [
{
"id": "a6fdd321-562c-4a40-97c7-4a34c097033d",
"name": "projectName",
"value": "contoso",
},
{
"id": "d3b5d3b6-66de-47b5-894b-cdecfc8afc40",
"name": "status",
"value": "open",
},
.....{etc}
]
}
There may be a lot of properties in the collection, all identified by the value of name. The fields in properties are pretty consistent -- there may be some variability, but they will all have the fields that I care about. There's an Id, some labels, etc
I'm wanting to combine these with some other data in PowerBI using the projectId to create some very valuable reports.
I think what I want to do it 'normalize' this data into a table, like:
ProjectId
projectName
status
openDate
closeDate
manager
p001
contoso
open
20200101
me
etc
​
Where I'm at...
I can go:
SELECT c["value"] AS ProjectName
FROM c in t.Properties
WHERE c["name"] = "projectName"
... this will give me each projectName
I can do that a heap of times to get the 'values' (status, openDate, manager, etc)
If I want to combine them together then I would need to combine all those sub-queries together with 'id'. But 'id' in not in the scope of the SELECT, so how do I get it?? If I were to do this, it sounds like something that would be very expensive (RU's) to execute.
I think I'm overcomplicating this, but I cant quite get my head around the Cosmos syntax.
Help??
You can achieve it with JOINS and the WHERE expressions although the scheme is not ideal for querying and you should consider changing it.
SELECT
c['projectId'], --c.projectId also works, but value is a reserved keyword
n['value'] AS projectName,
s['value'] AS status
FROM c
JOIN n IN c.properties
JOIN s IN c.properties
WHERE n['name'] = 'projectName' AND s['name'] = 'status'
--note all filtered properties must appear exactly once for it to work properly
Edit; new query that solves the potential issue that filtered properties must appear exactly once.
SELECT
c['projectId'],
ARRAY(
SELECT VALUE n['value']
FROM n IN c.properties
WHERE n['name'] = 'projectName'
)[0] AS projectName,
ARRAY(
SELECT VALUE n['value']
FROM n IN c.properties
WHERE n['name'] = 'status'
)[0] AS status
FROM c

Get multiple counts with one Cosmos DB query?

Consider these queries:
SELECT COUNT(1) AS failures 
FROM c 
WHERE c.time = 1623332779 AND c.status = 'FAILURE'
SELECT COUNT(1) AS successes 
FROM c 
WHERE c.time = 1623332779 AND c.status = 'SUCCESS'
How can I combine these two distinct queries into one query?
I tried repurposing the answers from How to get multiple counts with one SQL query?, but ran into a few problems:
COUNT(*) throws an error "Syntax error, incorrect syntax near '*'."
UNION throws "Syntax error, incorrect syntax near 'UNION'."
I also experimented with
SELECT 
SUM(CASE WHEN c.time = 1623332779 THEN 1 else 0 end)
FROM c
but this leads to another syntax error. I noticed that
SELECT COUNT(1) AS mycounter, COUNT(1) AS mycounter2 
FROM c
WHERE c.time = 1623332779
returns
[
{
"mycounter": 3,
"mycounter2": 3
}
]
but I was unable to link these distinct counters to distinct queries.
The following should work. The count operator skips values that are undefined which allows you to filter out rows from it:
SELECT
COUNT(c.status = 'SUCCESS' ? 1 : undefined) AS successes,
COUNT(c.status = 'FAILURE' ? 1 : undefined) AS failures
FROM c
WHERE c.time = 1623332779
It ruins performance though as it doesn't use indexing at all for the count. So you're better off using two seperate queries. If you really want to use a single request you could create a stored procedure that runs both queries and pastes the results together.
Instead of doing counts of the overall query, you can use GROUP BY to get counts in a single query. For example:
SELECT c.time, c.status, COUNT(c.status) AS statuscount
FROM c
WHERE c.time = "1623332779"
GROUP BY c.time, c.status
This won't give you explicit counts called "successes" and "failures" but it will return both counts, something like:
[
{
"time": "1623332779",
"status": "FAILURE",
"statuscount": 123
},
{
"time": "1623332779",
"status": "SUCCESS",
"statuscount": 456
}
]

Getting values from array in Cosmos Db

My document that I save in Cosmos DB looks like this:
{
"id": "abc123",
"myProperty": [
"1905844b-6ca9-4967-ba40-a736b685ca62",
"b03cc85c-ef0b-4f48-9c31-800de089190a"
]
}
As you can see, in the myProperty property, I have an array of GUID values and I want to read them as an array/list of GUID values but I'm having trouble formulating the correct SELECT statement.
The output I'm looking for is:
[
"1905844b-6ca9-4967-ba40-a736b685ca62",
"b03cc85c-ef0b-4f48-9c31-800de089190a"
]
The closest I could get is this `SELECT statement:
SELECT VALUE c.myProperty FROM c WHERE c.id = "abc123"
But this doesn't give me exactly what I want either. This gives me an array within an array i.e.
[
[
"1905844b-6ca9-4967-ba40-a736b685ca62",
"b03cc85c-ef0b-4f48-9c31-800de089190a"
]
]
What should my SELECT statement look like to get what I want?
I dont think you can ever get anything else, because cosmos db will always return an array in response to a query because potentially there can be 0-infinity results. so you will always get a top level array that will wrap all your results (even if you have only one)

How would I return a count of all rows in the table and then the count of each time a specific status is found?

Please forgive my ignorance on sqlalchemy, up until this point I've been able to navigate the seas just fine. What I'm looking to do is this:
Return a count of how many items are in the table.
Return a count of many times different statuses appear in the table.
I'm currently using sqlalchemy, but even a pure sqlite solution would be beneficial in figuring out what I'm missing.
Here is how my table is configured:
class KbStatus(db.Model):
id = db.Column(db.Integer, primary_key=True)
status = db.Column(db.String, nullable=False)
It's a very basic table but I'm having a hard time getting back the data I'm looking for. I have this working with 2 separate queries, but I have to believe there is a way to do this all in one query.
Here are the separate queries I'm running:
total = len(cls.query.all())
status_count = cls.query.with_entities(KbStatus.status, func.count(KbStatus.id).label("total")).group_by(KbStatus.status).all()
From here I'm converting it to a dict and combining it to make the output look like so:
{
"data": {
"status_count": {
"Assigned": 1,
"In Progress": 1,
"Peer Review": 1,
"Ready to Publish": 1,
"Unassigned": 4
},
"total_requests": 8
}
}
Any help is greatly appreciated.
I don't know about sqlalchemy, but it's possible to generate the results you want in a single query with pure sqlite using the JSON1 extension:
Given the following table and data:
CREATE TABLE data(id INTEGER PRIMARY KEY, status TEXT);
INSERT INTO data(status) VALUES ('Assigned'),('In Progress'),('Peer Review'),('Ready to Publish')
,('Unassigned'),('Unassigned'),('Unassigned'),('Unassigned');
CREATE INDEX data_idx_status ON data(status);
this query
WITH individuals AS (SELECT status, count(status) AS total FROM data GROUP BY status)
SELECT json_object('data'
, json_object('status_count'
, json_group_object(status, total)
, 'total_requests'
, (SELECT sum(total) FROM individuals)))
FROM individuals;
will return one row holding (After running through a JSON pretty printer; the actual string is more compact):
{
"data": {
"status_count": {
"Assigned": 1,
"In Progress": 1,
"Peer Review": 1,
"Ready to Publish": 1,
"Unassigned": 4
},
"total_requests": 8
}
}
If the sqlite instance you're using wasn't built with support for JSON1:
SELECT status, count(status) AS total FROM data GROUP BY status;
will give
status total
-------------------- ----------
Assigned 1
In Progress 1
Peer Review 1
Ready to Publish 1
Unassigned 4
which you can iterate through in python, inserting each row into your dict and adding up all total values in another variable as you go to get the total_requests value at the end. No need for another query just to calculate that number; do it manually. I bet it's really easy to do the same thing with your existing second sqlachemy query.

What is the best way to design a tag-based data table with Sqlite?

Json received from the server has this form.
[
{
"id": 1103333,
"name": "James",
"tagA": [
"apple",
"orange",
"grape"
],
"tagB": [
"red",
"green",
"blue"
],
"tagC": null
},
{
"id": 1103336,
"name": "John",
"tagA": [
"apple",
"pinapple",
"melon"
],
"tagB": [
"black",
"white",
"blue"
],
"tagC": [
"London",
"New York"
]
}
]
An object can have multiple tags, and a tag can be associated with multiple objects.
In this list, I want to find an object whose tagA is apple or grape and tagB is black.
This is the first table I used to write.
create table response(id integer primary key, name text not null, tagA text,
tagB text, tagC text)
select * from response where (tagA like '%apple%' or tagA like '%grape%') and (tagB like '%black%')
This type of table design has a problem that the search speed is very slow because it does not support the surface function of the fts function when using ORM library such as Room.
The next thing I thought about was to create a table for each tag.
create table response(id integer primary key, name text not null)
create table tagA(objectID integer, value text, primary key(objectID, value))
create table tagB(objectID integer, value text, primary key(objectID, value))
create table tagC(objectID integer, value text, primary key(objectID, value))
select * from response where id in ((select objectId from tagA where value in ('apple','grape'))
intersect
(select objectId from tagB where value in 'black'))
This greatly increases the insertion time and the capacity of the APK (roughly twice as much per additional table), but the search speed is far behind that of the FTS virtual table.
I want to avoid this as much as I use FTS tables because there are more things I need to manage myself.
There are a lot of things I missed (index etc.) but I can not figure out what it is.
How can I optimize the database without using the FTS method?
You could use a reference table (aka mapping table along with a multitude of other names) to allow a many-many relationship between tags (single table for all) and objects (again single table).
So you have the objects table each object having an id and you have the tags table again with an id for each object. So something like :-
DROP TABLE IF EXISTS object_table;
CREATE TABLE IF NOT EXISTS object_table (id INTEGER PRIMARY KEY, object_name);
DROP TABLE IF EXISTS tag_table;
CREATE TABLE IF NOT EXISTS tag_table (id INTEGER PRIMARY KEY, tag_name);
You'd populate both e.g.
INSERT INTO object_table (object_name) VALUES
('Object1'),('Object2'),('Object3'),('Object4');
INSERT INTO tag_table (tag_name) VALUES
('Apple'),('Orange'),('Grape'),('Pineapple'),('Melon'),
('London'),('New York'),('Paris'),
('Red'),('Green'),('Blue'); -- and so on
The you'd have the mapping table something like :-
DROP TABLE IF EXISTS object_tag_mapping;
CREATE TABLE IF NOT EXISTS object_tag_mapping (object_reference INTEGER, tag_reference INTEGER);
Overtime as tags are assigned to objects or vice-versa you add the mappings e.g. :-
INSERT INTO object_tag_mapping VALUES
(1,4), -- obj1 has tag Pineapple
(1,1), -- obj1 has Apple
(1,8), -- obj1 has Paris
(1,10), -- obj1 has green
(4,1),(4,3),(4,11), -- some tags for object 4
(2,8),(2,7),(2,4), -- some tags for object 2
(3,1),(3,2),(3,3),(3,4),(3,5),(3,6),(3,7),(3,8),(3,9),(3,10),(3,11); -- all tags for object 3
You could then have queries such as :-
SELECT object_name,
group_concat(tag_name,' ~ ') AS tags_for_this_object
FROM object_tag_mapping
JOIN object_table ON object_reference = object_table.id
JOIN tag_table ON tag_reference = tag_table.id
GROUP BY object_name
;
group_concat is an aggregate function (applied per GROUP) that concatenates all values found for the specified column with (optional) separator.
The result of the query being :-
The following could be a search based upon tags (not that you'd likely use both tag_name and a tag_reference) :-
SELECT object_name, tag_name
FROM object_tag_mapping
JOIN object_table ON object_reference = object_table.id
JOIN tag_table ON tag_reference = tag_table.id
WHERE tag_name = 'Pineapple' OR tag_reference = 9
;
This would result in :-
Note this is a simple overview e.g. you may want to consider having the mapping table as a WITHOUT ROWID table, perhaps have a composite UNIQUE constraint.
Additional re comment :-
How do I implement a query that contains two or more tags at the same
time?
This is a little more complex if you want specific tags but still doable. Here's an example using a CTE (Common Table Expression) along with a HAVING clause (a where clause applied after the output has been generated, so can be applied to aggregates) :-
WITH cte1(otm_oref,otm_tref,tt_id,tt_name, ot_id, ot_name) AS
(
SELECT * FROM object_tag_mapping
JOIN tag_table ON tag_reference = tag_table.id
JOIN object_table ON object_reference = object_table.id
WHERE tag_name = 'Pineapple' OR tag_name = 'Apple'
)
SELECT ot_name, group_concat(tt_name), count() AS cnt FROM CTE1
GROUP BY otm_oref
HAVING cnt = 2
;
This results in :-

Resources