How to update values in cosmosdb as output using azure stream analutics? - azure-cosmosdb

at first event I get data like below
{
'product_name':'hamam',
'quantity':'100'
}
at second I get data like below
{
'product_name':'hamam',
'quantity':'70'
}
here I wanna update the values in cosmos db, how can I do it?

ASA supports upserts feature for cosmos db if your data contains a unique document id.(Your sample data seems does not have it) Please see this paragraph about upserts in ASA for cosmos db.
Some excerpt as below:
Stream Analytics integration with Azure Cosmos DB allows you to insert or update records in your container based on a given Document ID column.
If the incoming JSON document has an existing ID field, that field is automatically used as the Document ID column in Cosmos DB and any subsequent writes are handled as such, leading to one of these situations:
unique IDs lead to insert
duplicate IDs and 'Document ID' set to 'ID' leads to upsert
duplicate IDs and 'Document ID' not set leads to error, after the
first document

Related

Upsert Cosmos item TTL using Azure Data Factory Copy Activity

I have a requirement to upsert data from REST API to Cosmos DB and also maintain the item level TTL for particular time interval.
I have used ADF Copy activity to copy the data but for TTL, used additional custom column at source side with hardcoded value 30.
Noticed that time interval (seconds) updating as string instead of integer. Hence failing with the below error.
Details
Failure happened on 'Sink' side. ErrorCode=UserErrorDocumentDBWriteError,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Documents failed to import due to invalid documents which violate some of Cosmos DB constraints: 1) Document size shouldn't exceeds 2MB; 2) Document's 'id' property must be string if any, and must not include the following charaters: '/', '', '?', '#'; 3) Document's 'ttl' property must not be non-digital type if any.,Source=Microsoft.DataTransfer.DocumentDbManagement,'
ttl Mapping between Custom column to cosmos DB
When i use ttl1 instead of ttl, it is getting success and value stored as string.
Any suggestion please?
Yes, that's the issue with additional columns in Copy activity. Even of you set it to int, it will change to string at the source.
The possible workaround is to create a Cosmos DB trigger in Azure function and add 'TTL' there.

Handling DocumentClientException with BulkImport

I am using Microsoft.Azure.CosmosDB.BulkExecutor.IBulkExecutor.BulkImportAsync to insert documents as a batch. I have implemented unique constraints for my cosmos db collection. If any of the input documents violates the constraint the entire bulk import operation fails with throwing DocumentClientException. Is this an expected behaviour? Or is there a way we can handle the exceptions for failed documents and make sure the valid documents are inserted?
First of all Thanks to Microsoft Document which has explained solid scenarios on the issue,
https://learn.microsoft.com/en-us/azure/data-factory/connector-troubleshoot-guide
This error appears when we define unique_key in addition to default id field defined by cosmos. The reason could be possible duplication of row for Unique Key in the dataset. Another possible reason, the Delta dataset which we are about to load has some of the unique keys which are already present in existing cosmos dataset.
For regular batch jobs there could be some updates happening on existing unique key itself, but we cannot update an existing unique key through batch process. As each record gets into cosmos as new record with new 'id' field value. Cosmos updates an existing record only same id field not on unique key.
Workaround, Since unique key is already going to be unique for every row across entire collection, we can define our unique value itself as also 'id' field. So now if we have any updates on addition field apart from unique key we can update them as 'id' field for respective unique key will also be same.
In SQL way,
SELECT <unique_key_field> AS id, <unique_key_field>, field1, field2 FROM <table_name>

UPSERT /INSERT/ UPDATE between Databricks to Cosmos

Currently we are using Azure Databricks as Transformation layer and transformed data are loaded to Cosmos DB through connector.
Scenario:
We have 2 files as source files.
1st File contains name,age
2nd file contains name, state, country
In Cosmos, I have created collection using id, Partition Key
In databricks, I am loading these 2 files as Dataframe and creating a temp table to query the content.
I am querying the content from the first file [ select name as id, name, age from file ]and loading the same to Cosmos Collection.
From the second file. I am using [ select name as id, state, country] and loading to the same collection expecting the content from the second file get inserted in the same collection in the same document based on id field.
The issue here is when I am loading the content from the second file, the attribute 'age' from the first file gets deleted and only id, name, state, country is seen in the cosmos document. This is happening because I am using UPSERT in databricks to load to Cosmos.
When I am changing the UPSERT to INSERT or UPDATE it throws as error which says 'Resource with id already exists'
Databricks Connection to Cosmos:
val configMap = Map(
"Endpoint" -> {"https://"},
"Masterkey" -> {""},
"Database" -> {"ods"},
"Collection" -> {"tval"},
"preferredRegions" -> {"West US"},
"upsert" -> {"true"})
val config = com.microsoft.azure.cosmosdb.spark.config.Config(configMap)
Is there a way to insert the attributes from second file without deleting the attribute which is already present. I am not using JOIN operation as the use case doesn't fit to use.
From a vague memory of doing this you need to set the id attribute on your data frame to match between the two datasets.
If you omit this field Cosmos generates a new record - which is what is happening for you.
So if df1 & df2 have id=1 on the first record then the first will insert it, the second will update it.
But if they are the same record then joining in Spark will be far more efficient.

DynamoDB sub item filter using .Net Core API

First of all, I have table structure like this,
Users:{
UserId
Name
Email
SubTable1:[{
Column-111
Column-112
},
{
Column-121
Column-122
}]
SubTable2:[{
Column-211
Column-212
},
{
Column-221
Column-222
}]
}
As I am new to DynamoDB, so I have couple of questions regarding this as follows:
1. Can I create structure like this?
2. Can we set primary key for subtables?
3. Luckily, I found DynamoDB helper class to do some operations into my DB.
https://www.gopiportal.in/2018/12/aws-dynamodb-helper-class-c-and-net-core.html
But, don't know how to fetch only perticular subtable
4. Can we fetch only specific columns from my main table? Also need suggestion for subtables
Note: I am using .net core c# language to communicate with DynamoDB.
Can I create structure like this?
Yes
Can we set primary key for subtables?
No, hash key can be set on top level scalar attributes only (String, Number etc.)
Luckily, I found DynamoDB helper class to do some operations into my DB.
https://www.gopiportal.in/2018/12/aws-dynamodb-helper-class-c-and-net-core.html
But, don't know how to fetch only perticular subtable
When you say subtables, I assume that you are referring to Array datatype in the above sample table. In order to fetch the data from DynamoDB table, you need hash key to use Query API. If you don't have hash key, you can use Scan API which scans the entire table. The Scan API is a costly operation.
GSI (Global Secondary Index) can be created to avoid scan operation. However, it can be created on scalar attributes only. GSI can't be created on Array attribute.
Other option is to redesign the table accordingly to match your Query Access Pattern.
Can we fetch only specific columns from my main table? Also need suggestion for subtables
Yes, you can fetch specific columns using ProjectionExpression. This way you get only the required attributes in the result set

Deleting items in a Dynamodb table with duplicate values

I have a dynamodb table with the following structure:
{
accountId: string,//PRIMARY KEY
userId: string,//SORT KEY
email: string,
dateCreated: number // timestamp
}
I want to perform an action that deletes all items with duplicate emails from the table except for the one with the oldest dateCreated attribute.
Is this operation possible in DynamoDB?
Thanks
Firstly, you need both partition and sort keys to delete an item from DynamoDB. Unless, you know the accountId and userId, you can't perform the delete item operation.
On the above use case, neither email nor dateCreated attribute is part of the key attribute.
Also, sort functionality is available for the sort key attribute only.
Approach 1:-
Preferred one if it is a one time activity
Get the data and identify the old values based on dateCreated at client side
Delete the data on DynamoDB based on accountId and userId
Approach 2:-
Preferred one if it is required frequently
Create a GSI with hash key as email and sort key as dateCreated
Assuming you know the email id that you wanted to query against and identify whether it has duplicates, you can use Query API with index name, email id value and ScanIndexForward value as false (i.e. descending order)
The result set will have email id with latest record at the top. You can ignore the top record and run the Delete API with accountId and userId for the rest of the items.
Approach 3:-
Preferred approach if the data can be manageable at flat file and run some program to find the duplicates
You can export the data to S3 bucket using AWS Data Pipeline
Run some program to read the file to find the duplicates and execute the DynamoDB delete query to delete the item
Approach 4:-
Preferred approach if the data is large
You can export the data to AWS EMR using AWS Data Pipeline
Run some queries to find the duplicates and execute the DynamoDB delete query to delete the item
Note:-
Please note that if you are expecting something like SQL with sub-queries to identify the latest updated record and delete the rest, it is NOT possible on DynamoDB
Export data to S3

Resources