What are the usual RUs values you expect when you run simple queries? - azure-cosmosdb-gremlinapi

I started a short time ago to experiment with gremlin and cosmos DB, with the idea of doing a prototype for a project at work.
How do I know if a "Request charge value" is good or bad?
For example, I have a query to get a list of flats located in a specific state which looks like this:
g.V().hasLabel('Flat').as('Flats').outE('has_property')
.inV().has('propertyType',eq('reference')).has('propertyName','City').outE('property_refers_to')
.inV().hasLabel('City').outE('has_property')
.inV().has('propertyType',eq('reference')).has('propertyName','State').outE('property_refers_to')
.inV().hasLabel('State').outE('has_property')
.inV().has('propertyType',eq('scalar')).has('name',eq('Some_State_Name'))
.values('name')
.select('Flats')
Each object-property is not stored directly in the node representing an "object" but has its own node. Some properties are "scalar" (like the name of a state or a city) and some are a reference to another node (like the city where a flat is located).
With this query, azure shows me a "Request charge" value of 73.179
Is it good?
What are the things I should try to improve?

Related

Querying details from GraphDB

We are trying to implement Customer oriented details in Graphdb, were with a single query we can fetch the details of a customer such as his address,phone,email etc. We have build it using had address, has email edges..
g.addV('member').property('id','CU10611972').property('CustomerId', 'CU10611972').property('TIN', 'xxxx').property('EntityType', 'Person').property('pk', 'pk')
g.addV('email').property('id','CU10611972E').property('pk', 'pk')
g.addV('primary').property('id','CU10611972EP').property('EmailPreference','Primary').property('EmailType', 'Home').property('EmailAddress', 'SNEHA#GMAIL.COM').property('pk', 'pk')
g.V('CU10611972').addE('has Email').to(g.V('CU10611972E'))
g.V('CU10611972E').addE('has Primary Email').to(g.V('CU10611972EP')
This is how we have build email relation to the customer.. Similarly we have relations with Address and Phone. So right now we are using this command to fetch the json related to this customer for email,
g.V('CU10611972').out('has Email').out('has Primary Email')
And for complete Customer details we are using union for each Vertex, Phone,Emaiul and address..
Could you please suggest if there is an efficient way to query this detail?
This comes down really to two things.
General graph data modelling
Things the graph DB you are using does and does not support.
With Gremlin there are a few ways to model this data for a single vertex.
If the database supports it, have a list of names like ['home','mobile'] and use metaproperties to attach a phone number to each.
A lot of the Gremlin implementations I am aware of have chosen not to support meta properties. In these cases you have a couple of options.
(a) Have a property for 'Home' and another for 'Mobile'. If either is not known you could either not create that property or give it a value such as "unknown"
(b) Use prefixed strings such as ["Home:123456789","Mobile:123456789] and store them in a set or list (multi properties) and access them in Gremlin using the startingWith predicate. Such as g.V(id).properties('phone').hasValue(startingWith('Mobile')).value()

DataPlaneRequests partitionID in CosmosDB diagnostic logs always seems to be empty

I have enabled diagnostic logging of a Cosmos Account (SQL interface). The diagnostic log data is being sent to a storage account - and I can see that there is a new DataPlaneRequests blob created every 5 minutes.
So far, so good.
I'm performing CRUD requests against a collection in the Cosmos account. I can see entries within the DataPlaneRequest logs that look like this ('*' used to protect the innocent)...
{ "time": "2020-01-28T03:04:59.2606375Z", "resourceId": "/SUBSCRIPTIONS/****/RESOURCEGROUPS/****/PROVIDERS/MICROSOFT.DOCUMENTDB/DATABASEACCOUNTS/**********", "category": "DataPlaneRequests", "operationName": "Query", "properties": {"activityId": "38f497ee-7e37-435f-8b4a-a2f0d8d65d12","requestResourceType": "DocumentFeed","requestResourceId": "/dbs/****/colls/****/docs","collectionRid": "","databaseRid": "","statusCode": "200","duration": "4.588500","userAgent": "Windows/10.0.14393 documentdb-netcore-sdk/2.8.1","clientIpAddress": "52...***","requestCharge": "4.160000","requestLength": "278","responseLength": "5727","resourceTokenUserRid": "","region": "West US 2","partitionId": ""}}
Every entry in the DataPlaneRequests log has an empty partitionId property value.
(The operationName property value in the log is either "Create" or "Query").
So my question is - why is this property empty?
Here is the documentation for DataPlaneRequests
What I'm actually trying to accomplish, is to obtain information about the load being placed on the physical partitions of a collection.
e.g. I'd like to know that during the past 10 minutes, 10k Create operations were performed in physical-partition "1", while 55k operations were performed in physical-partition "3".
That will allow me to have much more insight into why a collection is experiencing throttling, etc.
When you connect to Cosmos, there are two connection modes available: Gateway and Direct. It turns out that only Direct mode, causes the partitionId to be included in the logs. (If you read up about how these two modes work (differently), then that makes sense).
Anyway, it turns out that the partitionId in the logs is not a reference to a physical partition of a collection. So I'm unable to use that data to solve the problem, I was attempting to solve.
There is a physical partition id available in the logs - but it's also of limited use - since it's only tracked for the 3 largest (logical) PartitionKey values, of each physical partition, only if the key-value contains >=1Gb of documents.

How can I query for all new and updated documents since last query?

I need to query a collection and return all documents that are new or updated since the last query. The collection is partitioned by userId. I am looking for a value that I can use (or create and use) that would help facilitate this query. I considered using _ts:
SELECT * FROM collection WHERE userId=[some-user-id] AND _ts > [some-value]
The problem with _ts is that it is not granular enough and the query could miss updates made in the same second by another client.
In SQL Server I could accomplish this using an IDENTITY column in another table. Let's call the table version. In a transaction I would create a new row in the version table, do the updates to the other table (including updating the version column with the new value. To query for new and updated rows I would use a query like this:
SELECT * FROM table WHERE userId=[some-user-id] and version > [some-value]
How could I do something like this in Cosmos DB? The Change Feed seems like the right option, but without the ability to query the Change Feed, I'm not sure how I would go about this.
In case it matters, the (web/mobile) clients connect to data in Cosmos DB via a web api. I have control of the entire stack - from client to back-end.
As the statements in this link:
Today, you see all operations in the change feed. The functionality
where you can control change feed, for specific operations such as
updates only and not inserts is not yet available. You can add a “soft
marker” on the item for updates and filter based on that when
processing items in the change feed. Currently change feed doesn’t log
deletes. Similar to the previous example, you can add a soft marker on
the items that are being deleted, for example, you can add an
attribute in the item called "deleted" and set it to "true" and set a
TTL on the item, so that it can be automatically deleted. You can read
the change feed for historic items, for example, items that were added
five years ago. If the item is not deleted you can read the change
feed as far as the origin of your container.
Change feed is not available for your requirements.
My idea:
Use Azure Function Cosmos DB Trigger to collect all the operations in your specific cosmos collection. Follow this document to configure the input of azure function as cosmos db, then follow this document to configure the output as azure queue storage.
Get the ids of changed items and send them into queue storage as messages.When you want to query the changed item,just query the messages from the queue to consume them at a specific unit time and after that just clear the entire queue. No items will be missed.
With your approach, you can get added/updated documents and save reference value (_ts and id field) somewhere (like blob)
SELECT * FROM collection WHERE userId=[some-user-id] AND _ts > [some-value] and id !='guid' order by _ts desc
This is a similar approach we use to read data from Eventhub and store checkpointing information (epoch number, sequence number and offset value) in blob. And at a time only one function can take a lease of that blob.
If you go with ChangeFeed, you can create listener (Function or Job) to listen all add/update data from collection and you can store those value in some collection, while saving data you can add Identity/version field on every document. This approach may increase your cosmos DB bill.
This is what the transaction consistency levels are for: https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels
Choose strong consistency and your queries will always return the latest write.
Strong: Strong consistency offers a linearizability guarantee. The
reads are guaranteed to return the most recent committed version of an
item. A client never sees an uncommitted or partial write. Users are
always guaranteed to read the latest committed write.

How can I use a gremlin query to filter based on a users permissions?

I am fairly new to graph databases, however I have used SQL Server and document databases (Lucene, DocumentDb, etc.) extensively. It's completely possible that I am approaching this query the wrong way, since I am new to graph databases. I am trying to convert some logic to a graph database (CosmosDB Graph via Gremlins to be specific) that we currently are using SQL Server for. The reason for the change is that this problem set is not really what SQL Server is great at and so our SQL query (which we have optimized as good as we can) is really starting to be the hot spot of our application.
To give a very brief overview of our logic, we run a web shop that allows admins to configure products and users with several levels of granular permissions (described below). Based on these permissions, we show the user only the products they are allowed to see.
Entities:
Region: A region consists of multiple countries
Country: A country has many markets and many regions
Market: A market is a group of stores in a single country
Store: A store is belongs to a single market
Users have the following set of permissions and each set can contain multiple values:
can-view-region
can-view-country
can-view-market
can-view-store
Products have the following set of permissions and each set can contain multiple values:
visible-to-region
visible-to-country
visible-to-market
visible-to-store
After trying for a few days, this is the query that I have come up with. This query does work and returns the correct products for the given user, however it takes about 25 seconds to execute.
g.V().has('user','username', 'john.doe').union(
__.out('can-view-region').out('contains-country').in('in-market').hasLabel('store'),
__.out('can-view-country').in('in-market').hasLabel('store'),
__.out('can-view-market').in('in-market').hasLabel('store'),
__.out('can-view-store')
).dedup().union(
__.out('in-market').in('contains-country').in('visible-to-region').hasLabel('product'),
__.out('in-market').in('visible-to-country').hasLabel('product'),
__.out('in-market').in('visible-to-market').hasLabel('product'),
__.in('visible-to-store').hasLabel('product')
).dedup()
Is there a better way to do this? Is this problem maybe not best suited with a graph database?
Any help would be greatly appreciated!
Thanks,
Chris
I don't think this is going to help a lot, but here's an improved version of your query:
g.V().has('user','username', 'john.doe').union(
__.out('can-view-region').out('contains-country').in('in-market').hasLabel('store'),
__.out('can-view-country','can-view-market').in('in-market').hasLabel('store'),
__.out('can-view-store')
).dedup().union(
__.out('in-market').union(
__.in('contains-country').in('visible-to-region'),
__.in('visible-to-country','visible-to-market')).hasLabel('product'),
__.in('visible-to-store').hasLabel('product')
).dedup()
I wonder if the hasLabel() checks are really necessary. If, for example, .in('in-market') can only lead a store vertex, then remove the extra check.
Furthermore it might be worth to create shortcut edges. This would increase write times whenever you mutate the permissions, but should significantly increase the read times for the given query. Since the reads are likely to occur way more often than permission updates, this might be a good trade-off.
CosmosDB Graph team is looking into improvements that can done on union step in particular.
Other options that haven't already been suggested:
Reduce the number of edges that are traversed per hop with additional predicates. e.g:
g.V('1').outE('market').has('prop', 'value').inV()
Would it be possible to split the traversal up and do parallel request in your client code? Since you are using .NET, you could take each result in first union, and execute parallel requests for the traversals in the second union. Something like this (untested code):
string firstUnion = #"g.V().has('user','username', 'john.doe').union(
__.out('can-view-region').out('contains-country').in('in-market').hasLabel('store'),
__.out('can-view-country').in('in-market').hasLabel('store'),
__.out('can-view-market').in('in-market').hasLabel('store'),
__.out('can-view-store')
).dedup()"
string[] secondUnionTraversals = new[] {
"g.V({0}).out('in-market').in('contains-country').in('visible-to-region').hasLabel('product')",
"g.V({0}).out('in-market').in('visible-to-country').hasLabel('product')",
"g.V({0}).out('in-market').in('visible-to-market').hasLabel('product')",
"g.V({0}).in('visible-to-store').hasLabel('product')",
};
var response = client.CreateGremlinQuery(col, firstUnion);
while (response.HasMoreResults)
{
var results = await response.ExecuteNextAsync<Vertex>();
foreach (Vertex v in results)
{
Parallel.ForEach(secondUnionTraversals, (traversal) =>
{
var secondResponse = client.CreateGremlinQuery<Vertex>(col, string.Format(traversal, v.Id));
while (secondResponse.HasMoreResults)
{
concurrentColl.Add(secondResponse);
}
});
}
}

Freebase get singer from song

I want to develop an app that pulls the singers of any song that we query for. So if someone types in Carry On from the Some Nights album, the app is supposed to pull out who all sang that song. Thanks.
You can search for this using the Freebase Search API and Search Metaschema like this:
https://www.googleapis.com/freebase/v1/search?query=Carry+On&filter=(all+/music/release_track/release:"Some+Nights")&output=(/music/release_track/release+/music/release_track/recording./music/recording/artist)
There are three parts to this API request: the query, the filter and the output parameter. The query is simply the name of the track that you're looking for:
query=Carry+On
The filter parameter constrains the results to only tracks which are part of an album release named "Some Nights"
filter=(all+/music/release_track/release:"Some+Nights")
The output parameter tells the API which properties to return in the response. In this case we want to know which release the track is part of and which artist recorded the track.
output=(/music/release_track/release+/music/release_track/recording./music/recording/artist)
You'll notice that this query actually returns 8 matching tracks right now. This is because there were many different releases of the album which all contained recordings of that track (and not necessarily the exact same recording).
For what you're building it sounds like you should be able to just take the first result. You can constrain the search API to only return the first result by adding a limit parameter to the request:
limit=1

Resources