Some nosql cosmosdb advice required - azure-cosmosdb

I am looking for some advice in designing an application using NoSQL CosmosDB or relevant technology.
The data structure looks like the following currently:
{
"accounts": [{
"name": "name1",
"type": "type1"
},
{
"name": "name2",
"type": "type2"
}
],
"categories": [{
"master": "mastername",
"child": [
"child1name",
"child2name"
]
},
{
"master": "mastername2",
"child": [
"child3name",
"child4name"
]
}
],
"charts": {
},
"grouping": [{
"2018": [{
"06": {
"property1": "value1",
"property2":"value2"
},
"07": {
"property1": "value2",
"property2":"value2",
"property3":"value3"
}
}]
}],
"ItemsList": [{
"id": "2018051720",
"dateMonth": "201807",
"property1": "value2",
"date": "17/07/2018",
"Description": "description2"
},
{
"id": "2018051720",
"datemonth": "201807",
"property1": "value1",
"date": "17/07/2018",
"Description": "description"
}
],
"id": "7b786960c93cc9a8"
}
Currently I have decided to have the one collection, because of budget concerns and inside that, will have a multiple of the data structure you see above, so like a list of it.
My question is, is this a good design, the reason for asking is that the following elements can grow quite substantially over time.
ItemList and Grouping.
Itemlist will grow every month by users adding to it, and Grouping will be for every year and month, once a month, but updated as ItemList items are added. Categories and accounts could also change but irregularly.
If I have this in one collection, I was thinking maybe I somehow, have the following structure:
// Main Object
{
"accounts": [{
"name": "name1",
"type": "type1"
},
{
"name": "name2",
"type": "type2"
}
],
"categories": [{
"master": "mastername",
"child": [
"child1name",
"child2name"
]
},
{
"master": "mastername2",
"child": [
"child3name",
"child4name"
]
}
],
"charts": {
},
"id": "7b786960c93cc9a8"
}
// Groupings list
{
"grouping": [{
"userid": "7b786960c93cc9a8",
"grouping": {
"2018": [{
"06": {
"property1": "value1",
"property2": "value2"
},
"07": {
"property1": "value2",
"property2": "value2",
"property3": "value3"
}
}]
}
},
{
"userid": "sfkjehffkjwhf34343",
"grouping": {
"2018": [{
"04": {
"property1": "value1",
"property2": "value2"
},
"05": {
"property1": "value2",
"property2": "value2",
"property3": "value3"
},
"06": {
"property1": "value2",
"property2": "value2",
"property3": "value3"
}
}]
}
}
]
}
// Item List List
{
"ItemLists": [{
"userid": "7b786960c93cc9a8",
"itemlist": [{
"id": "2018051720",
"dateMonth": "201807",
"property1": "value2",
"date": "17/07/2018",
"Description": "description2"
},
{
"id": "2018051720",
"datemonth": "201807",
"property1": "value1",
"date": "17/07/2018",
"Description": "description"
}
]
},
{
"userid": "sfkjehffkjwhf34343",
"itemlist": [{
"id": "2018051720",
"dateMonth": "201807",
"property1": "value2",
"date": "17/07/2018",
"Description": "description2"
},
{
"id": "2018051720",
"datemonth": "201807",
"property1": "value1",
"date": "17/07/2018",
"Description": "description"
}
]
}
]
}
As you can see, I will basically have the main object list with it growing like normal, and then the other json objects for itemlist and grouping where it can grow indepentdantly from the main object, but it would require Two Reads then or even three RU's for the website. Working on only having 400 RU's a month basically, its not a lot of user base and objects?
What is the best way to do this while thinking about money, because if money was no problem, I would have prob gone with a collection for each, where the main object just references the other collection by Id or something.
Hope it makes a bit of sense, in my head it does :)

Imho you're making the age-old mistake of worrying about optimization before a problem arises. Also your sentence "Working on only having 400 RU's a month" somehow makes me feel like you should read up more on the topic of RU's
Check here for Information about RU's and tools to estimate your throughput
400 RU's which cap's your collections "throughput" might slow down your end-user's experience (there might be other bottle necks - usually their on-premise internet connection)
You can always watch the usage of your collections in the Azure Portal and upscale within minutes - so you cannot go wrong by starting with 400RUs
Every request not made is the biggest possible boost to performance
Request in CosmosDB are already bloated with headers for security - you will not have notable performance boosts for shaving a few bytes off your objects here and there, but local caching (be it on your webserver or on the user's machine) will, and it's very easy to do if you simply store the whole Json Objects as key-value pairs (basically what CosmosDB does).
What would be wrong in my opinion is considering multiple collections. I think you have misunderstood the concept there a little. One collection per customer/project is usually the way to go, so don't worry. Everything is indexed and uniquely ID'ed internally so separating things is no problem. One collection per "object type" makes the advantage of any NoSQL Database moot.
If you worry about your "internal lists" getting too long, just save them separately and only save their id in the original object. Then you load them on-demand in your application. Generally speaking more small objects are better than few large objects - if you are able to load them cleverly in your application.
So instead of this:
{
"userid": "sfkjehffkjwhf34343",
"grouping": {
"2018": [{
"04": {
"property1": "value1",
"property2": "value2"
},
"05": {
"property1": "value2",
"property2": "value2",
"property3": "value3"
},
"06": {
"property1": "value2",
"property2": "value2",
"property3": "value3"
}
}]
}
}
you could do this instead
{
"userid": "sfkjehffkjwhf34343",
"grouping": {
"2018": ["x1","x2","x3"]
}
}
{
"groupingid": "x1",
"month":"04",
"values": {
"property1": "value1",
"property2": "value2"
}
}
{
"groupingid": "x2",
"month":"05",
"values": {
"property1": "value1",
"property3": "value3",
"property2": "value2"
}
}
{
"groupingid": "x3",
"month":"06",
"values": {
"property1": "value1",
"property2": "value2"
}
}
Load them only if needed, cache them according to their internal id (which changes on every update if you leave it out) and you won't believe how performant this can be.
You should read up on stored procedures also, they are a really powerful and in some cases a true gold mine for performance improvements.
There is a lot of good Information from Microsoft out there, though admittedly it's not easy to find sometimes.
CosmosDB is frankly an incredible powerful tool if used correctly, but I encourage you to read up on it a little more you you can use it effectively, performance- and cost-wise.

Related

Search for available items for booking

I am working on a online booking system of items.
I am using mongo to store booking and item details
Item
{
id: "3",
"name": "",
"description": "",
"extra": [{}]
}
Booking
{
"id": "",
"itemId":""
"startDate": millis,
"endDate": millis,
"status": "",
"userId": ""
}
I have to implement search b/w dates. The search should return only available items for the specified period. How can I build a scalable search for this? I am planning to use elastic also for search. Any suggestion related to new technology also welcome.
I'd suggest making the booking the base object and putting the item info inside it. That is to say:
Set up mapping:
PUT bookings
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"item": {
"properties": {
"id": {
"type": "keyword"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"description": {
"type": "text"
},
"extra": {
"type": "nested"
}
}
},
"startDate": {
"type": "date",
"format": "epoch_millis"
},
"endDate": {
"type": "date",
"format": "epoch_millis"
},
"status": {
"type": "keyword"
},
"userId": {
"type": "keyword"
}
}
}
}
Ingest the simplest booking
POST bookings/_doc
{
"item": {
"id": "987"
},
"startDate": 1587110540025,
"endDate": 1587220730025
}
Restricting the *Date fields and only returning the corresponding item:
GET bookings/_search
{
"_source": "item",
"query": {
"bool": {
"must": [
{
"range": {
"startDate": {
"gte": "17/04/2020",
"format": "dd/MM/yyyy"
}
}
},
{
"range": {
"endDate": {
"lte": "18/04/2020",
"format": "dd/MM/yyyy"
}
}
}
]
}
}
}
Note that although our date fields are defined as epoch_millis, we can still query using human-readable date strings, provided we specify the format. You can of course use milliseconds if you prefer.
While indexing the items to Elasticsearch you can check bookings. Think that, you are indexing items and you get the item from Mongo. Also, you can get the bookings for this item and you can add a field like bookingCount inside the item document of Elasticsearch. While searching you can use bookingCount field to search without booking items.
In generally, the indexing is async operations. You can use queue. So, this will reduce latency for the user operations. And, you can do what you want in there. You can get a summary with bookings and you can put inside the item.
{
id: "3",
"name": "",
"description": "",
"extra": [{}],
"bookingCount": "",
"bookingsByStatus": {
"status_1": 1233,
"status_2": 1233,
...
}
}
But this is a business decision. And after any update of items and booking, you need yo update the item from Elasticsearch index. Also, you can use other solution like mentione by #jzzfs.

How to insert empty string in DynamoDB using the output of a Lambda in Step Functions?

I'm trying to save the output of a Lambda which calls Lex to DynamoDB using Step Functions.
The intentName in a Lex response is sometimes null (unknown). The problem is that in the state (task) that saves the response to DynamoDB, because of this empty string I get an error from DynamoDB.
Is there any workaround, maybe using JsonPath or the state machine diagram of the Step Function, in order to insert null or maybe no insert that specific property in DynamoDB?
Here is the JSON for the state machine:
{
"StartAt": "ProcessLex",
"States": {
"ProcessLex": {
"Type": "Task",
"Resource": "arn:aws:lambda:<Region>:<Account Id>:function:getIntent",
"ResultPath": "$.lexResult",
"Next": "ChoiceIfIntent"
},
"SaveToDynamo": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:putItem",
"Parameters": {
"TableName": "MyTable",
"Item": {
"dateTime": {
"S.$": "$.dateTime"
},
"intentName": {
"S.$": "$.lexResult.intentName"
},
"analysis": {
"M.$": "$.lexResult.sentimentResponse"
}
}
},
"End": true
},
"Comprehend": {
"Comment": "To be implemented later",
"Type": "Pass",
"End": true
},
"ChoiceIfIntent": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.lexResult.intentName",
"StringGreaterThanEquals": "",
"Next": "SaveToDynamo"
}
],
"Default": "Comprehend"
}
}
}
The problem is not the null value, the problem is that in DynamoDB with the PutItem Api you cannot insert empty strings.
I know this is frustrating but the quickest solution is to replace "" with NULL.
The solution that I prefer is to set the convertEmptyValue to true in your DynamoDb client settings.
const dynamodb = new AWS.DynamoDB.DocumentClient({ convertEmptyValues: true })
UPDATE
Since yesterday, DynamoDB supports empty values for string!
Take a look here.

How to associate nested relationships with attributes for a POST in JSON API

According to the spec, resource identifier objects do not hold attributes.
I want to do a POST to create a new resource which includes other nested resource.
These are the basic resources: club (with name) and many positions (type). Think a football club with positions like goalkeeper, goalkeeper, striker, striker, etc.
When I do this association, I want to set some attributes like is the position required for this particular team. For example I only need 1 goalkeeper but I want to have a team with many reserve goalkeepers. When I model these entities in the DB I'll set the required attribute in a linkage table.
This is not compliant with JSON API:
{
"data": {
"type": "club",
"attributes": {
"name": "Backyard Football Club"
},
"relationships": {
"positions": {
"data": [{
"id": "1",
"type": "position",
"attributes": {
"required": "true"
}
}, {
"id": "1",
"type": "position",
"attributes": {
"required": "false"
}
}
]
}
}
}
}
This is also not valid:
{
"data": {
"type": "club",
"attributes": {
"name": "Backyard Football Club",
"positions": [{
"position_id": "1",
"required": "true"
},
{
"position_id": "1",
"required": "false"
}]
}
}
}
So how is the best way to approach this association?
The best approach here will be to create a separate resource for club_position
Creating a club will return a url to a create club_positions, you will then post club_positions to that url with a relationship identifier to the position and club resource.
Added benefit to this is that club_positions creation can be parallelized.

How to form inner SubQuery in Gremlin Server (Titan 1.0)?

I'm using Following Query :
g.V(741440).outE('Notification').order().by('PostedDateLong', decr).range(0,1).as('notificationInfo').match(
__.as('notificationInfo').inV().as('postInfo'),
).select('notificationInfo','postInfo')
it is giving following result :
{
"requestId": "9846447c-4217-4103-ac2e-de3536a3c62a",
"status": {
"message": "",
"code": ​200,
"attributes": { }
},
"result": {
"data": [
{
"notificationInfo": {
"id": "c0zs-fw3k-347p-g2g0",
"label": "Notification",
"type": "edge",
"inVLabel": "Comment",
"outVLabel": "User",
"inV": ​749664,
"outV": ​741440,
"properties": {
"ParentPostId": "823488",
"PostedDate": "2016-05-26T02:35:52.3889982Z",
"PostedDateLong": ​635998269523889982,
"Type": "CommentedOnPostNotification",
"NotificationInitiatedByVertexId": "1540312"
}
},
"postInfo": {
"id": ​749664,
"label": "Comment",
"type": "vertex",
"properties": {
"PostImage": [
{
"id": "amto-g2g0-2wat",
"value": ""
}
],
"PostedByUser": [
{
"id": "am18-g2g0-2txh",
"value": "orbitpage#gmail.com"
}
],
"PostedTime": [
{
"id": "amfg-g2g0-2upx",
"value": "2016-05-26T02:35:39.1489483Z"
}
],
"PostMessage": [
{
"id": "aln0-g2g0-2t51",
"value": "hi"
}
]
}
}
}
],
"meta": { }
}
}
I want to get information of Vertex "NotificationInitiatedByVertexId" (Edge Property ) in the response as well.
For that i tried following query :
g.V(741440).outE('Notification').order().by('PostedDateLong', decr).range(0,2).as('notificationInfo').match(
__.as('notificationInfo').inV().as('postInfo'),
g.V(1540312).next().as('notificationByUser')
).select('notificationInfo','postInfo','notificationByUser')
Note : I tried directly with vertex Id in subquery as I wasn't aware how to dynamically get value from edge property in query itself.
It is giving error. I tried a lot but am not able to find any solution.
I'm assuming that you are storing a Titan generated identifier in that edge property called NotificationInitiatedByVertexId. If so, please consider the following even though this first part doesn't really answer your question. I don't think you should store a vertex identifier on the edge. Your graph model should explicitly track the relationship of NotificationInitiatedBy with an edge and by storing the identifier of the vertex on the edge itself you are bypassing that. Also, if you ever have to migrate your data in some way, the ids won't be preserved (Titan will generate new ones) and trying to sort that out will be a mess.
Even if that is not a Titan generated identifier and a logical one you created, I still think I would look to adjust your graph schema and promote that Notification to a vertex. Then your Gremlin traversals would flow more easily.
Now, assuming you don't change that, then I don't see a reason to not just issue two queries in the same request and then combine the results to one data structure. You just need to do a lookup with the vertex id which is going to be pretty fast and inexpensive:
edgeStuff = g.V(741440).outE('Notification').
order().by('PostedDateLong', decr).range(0,1).as('notificationInfo').
... // whatever logic you have
select('notificationInfo','postInfo').next()
vertexStuff = g.V(edgeStuff.get('notificationInfo').value('NotificationInitiatedByVertexId')).next()
[notificationInitiatedBy: vertexStuff, notification: edgeStuff]

JSONAPI - Difference between self and related in a links resource

Why is the self and related references different in the below JSONAPI resource? Aren't they pointing to the same resource? What is the difference between going to /articles/1/relationships/tags and /articles/1/tags?
{
"links": {
"self": "/articles/1/relationships/tags",
"related": "/articles/1/tags"
},
"data": [
{ "type": "tags", "id": "2" },
{ "type": "tags", "id": "3" }
]
}
You can read about that here: https://github.com/json-api/json-api/issues/508.
Basically, with /articles/1/relationships/tags response will be object which represents relationship between articles and tags. The response could be something like this (what you put in your question):
{
"links": {
"self": "/articles/1/relationships/tags",
"related": "/articles/1/tags"
},
"data": [
{ "type": "tags", "id": "2" },
{ "type": "tags", "id": "3" }
]
}
This response gives only the necessary data (in primary data attribute - data) to manipulate the relationship and not resources connected with relationship. That being said, you'll call /articles/1/relationships/tags if you want to create new relationship, add a new tag (basically updating relationship) to article, read which tags belong to article (you only need identity to search them on server) or delete article tags.
On the other hand, calling /articles/1/tags will respond with tags as primary data with all the other properties that they have (articles, relationships, links, and other top-level attributes such include, emphasized text, links and/or jsonapi).
They are different. Here is an example from my project.
Try Get http://localhost:3000/phone-numbers/1/relationships/contact you will get response like this:
{
"links": {
"self": "http://localhost:3000/phone-numbers/1/relationships/contact",
"related": "http://localhost:3000/phone-numbers/1/contact"
},
"data": {
"type": "contacts",
"id": "1"
}
}
You didn't get the attributes and relationships which is probably you want to retrieve.
Then
Try Get http://localhost:3000/phone-numbers/1/contact you will get response like this:
{
"data": {
"id": "1",
"type": "contacts",
"links": {
"self": "http://localhost:3000/contacts/1"
},
"attributes": {
"name-first": "John",
"name-last": "Doe",
"email": "john.doe#boring.test",
"twitter": null
},
"relationships": {
"phone-numbers": {
"links": {
"self": "http://localhost:3000/contacts/1/relationships/phone-numbers",
"related": "http://localhost:3000/contacts/1/phone-numbers"
}
}
}
}
}
You can see you retrieved all the information you want, including the attributes and relationships.
But you should know that relationships can be used for some purpose. Please read http://jsonapi.org/format/#crud-updating-to-one-relationships as a sample.

Resources