How to improve the performance when copying data from cosmosdb? - azure-cosmosdb

I am now trying to copy data from cosmosdb to data lake store by data factory.
However, the performance is poor, about 100KB/s, and the data volume is 100+ GB, and keeps increasing. It will take 10+ days to finish, which is not acceptable.
Microsoft document https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-performance mentioned that the max speed from cosmos to data lake store is 1MB/s. Even this, the performance is still bad for us.
The cosmos migration tool doesn't work, no data exported, and no issue log.
Data lake analytics usql can extract external sources, but currently only Azure DB/DW and SQL Server are supported, no cosmosdb.
How/what tools can improve the copy performance?

According to your description, I suggest you could try to set high cloudDataMovementUnits to improve the performance.
A cloud data movement unit (DMU) is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Data Factory. A DMU might be used in a cloud-to-cloud copy operation, but not in a hybrid copy.
By default, Data Factory uses a single cloud DMU to perform a single Copy Activity run. To override this default, specify a value for the cloudDataMovementUnits property as follows. For information about the level of performance gain you might get when you configure more units for a specific copy source and sink, see the performance reference.
Notice: Setting of 8 and above currently works only when you copy multiple files from Blob storage/Data Lake Store/Amazon S3/cloud FTP/cloud SFTP to Blob storage/Data Lake Store/Azure SQL Database.
So the max DMU you could set is 4.
Besides, if this speed doesn't match your current requirement.
I suggest you could write your own logic to copy the documentdb to data lake.
You could create multiple webjobs which could use parallel copy from the documentdb to data lake.
You could convert the document according to index range or partition, then you could make each webjob copy different part. In my opinion, this will be faster.
About the dmu, can I use it directly or should I apply for it first? The web jobs you mean is dotnet activity? Can you give some more details?
As far as I know, you could directly use the dmu, you could directly add the dmu value in the json file as below:
"activities":[
{
"name": "Sample copy activity",
"description": "",
"type": "Copy",
"inputs": [{ "name": "InputDataset" }],
"outputs": [{ "name": "OutputDataset" }],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"cloudDataMovementUnits": 32
}
}
]
The webjob which could run programs or scripts in WebJobs in your Azure App Service web app in three ways: on demand, continuously, or on a schedule.
That means you could write a C# program(or using other code language) to run the programs or scripts to copy the data from documentdb to data lake(all of the logic should be written by yourself).

Related

Azure Synapse replicated to Cosmos DB?

We have a Azure data warehouse db2(Azure Synapse) that will need to be consumed by read only users around the world, and we would like to replicate the needed objects from the data warehouse potentially to a cosmos DB. Is this possible, and if so what are the available options? (transactional, merege, etc)
Synapse is mainly about getting your data to do analysis. I dont think it has a direct export option, the kind you have described above.
However, what you can do, is to use 'Azure Stream Analytics' and then you should be able to integrate/stream whatever you want to any destination you need, like an app or a database ands so on.
more details here - https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-integrate-azure-stream-analytics
I think you can also pull the data into BI, and perhaps setup some kind of a automatic export from there.
more details here - https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-get-started-visualize-with-power-bi

Can Cosmos DB read data from File Blob or Csv or Json file at a batch size?

I am currently researching around reading data using cosmos db, basically our current approach is using a .Net Core C# application with Cosmos DB SDK to read entire data from a file blob or csv or json file, and then use the for loop, one by one pulling its information from cosmos db and compare/insert/update, this somehow feels inefficient.
We're curious if cosmos DB could perform the ability to read a bunch of data (let's say a batch size of 5000 records) from file blob or csv or json file and similar like SQL server, do a bulk insert or merge statement within the cosmos DB directly? Basically the goal is not doing same operation one by one for each item interacting with cosmos DB.
I've noticed and researched in BulkExecutor as well, the BulkUpdate looks like a more straightforward way of directly updating an item without considering if it should update. In my case for example, if I have 1000 items, only 300 items' properties got changed, so I'll just need to update those 300 items without updating the irrelevant remaining 700 items as well. Basically I need to find out a way to have Cosmos DB do the data compare as in a collection, not inside a loop and focus on each single item, it could either perform a update or output a collection that I can use for later updating as well.
Would the (.Net + SDK) application be able to perform that or would a cosmos DB stored procedure could handle similar job? Any other Azure tool is welcome as well!
What you are looking for is the Cosmos DB Bulk Executor library
It is designed to operate using millions of records in bulk and it is very efficient.
You can find the .NET documentation here

SCDF Metrics Collector - Include Prometheus metrics

I am using SCDF with Spring Boot 2.x metrics and SCDF metrics collector to collect metrics from my Spring Boot app. I really do not uderstand the logic of the collector regarding the aggregateMetricsdata.
When I am fetching the list of metrics collected for my stream, I only have the one starting with integration.channel.*and thus I have on ly the mean value. I tried everything to see other metrics appearing like the one exposed by the /actuator/prometheus endpoint.
I think I have misunderstand the way the metrics are aggregated. I noticed that SCDF add automatically some properties to metrics and I would like to apply these properties to all my metrics exposed in order to collect them all.
{
"_embedded": {
"streamMetricsList": [
{
"name": "poc-stream",
"applications": [
{
"name": "poc-message-sink",
"instances": [
{
"guid": "poc-stream-poc-message-sink-v7-75b8f4dcff-29fst",
"key": "poc-stream.poc-message-sink.poc-stream-poc-message-sink-v7-75b8f4dcff-29fst",
"properties": {
"spring.cloud.dataflow.stream.app.label": "poc-message-sink",
"spring.application.name": "poc-message-sink",
"spring.cloud.dataflow.stream.name": "poc-stream",
"spring.cloud.dataflow.stream.app.type": "sink",
"spring.cloud.application.guid": "poc-stream-poc-message-sink-v7-75b8f4dcff-29fst",
"spring.cloud.application.group": "poc-stream",
"spring.cloud.dataflow.stream.metrics.version": "2.0"
},
"metrics": [
{
"name": "integration.channel.input.send.mean",
"value": 0,
"timestamp": "2018-10-25T16:34:39.889Z"
}
]
}
],
"aggregateMetrics": [
{
"name": "integration.channel.input.send.mean",
"value": 0,
"timestamp": "2018-10-25T16:34:52.894Z"
}
]
},
...
I have some Micrometer counters that I want to get the values with the Metrics collector. I know they are well exposed because I have set all the properties right and I even went into the Docker container launched to check the endpoints.
I have read that
When deploying applications, Data Flow sets the
spring.cloud.stream.metrics.properties property, as shown in the
following example:
spring.cloud.stream.metrics.properties=spring.application.name,spring.application.index,spring.cloud.application.*,spring.cloud.dataflow.*
The values of these keys are used as the tags to perform aggregation.
In the case of 2.x applications, these key-values map directly onto
tags in the Micrometer library. The property
spring.cloud.application.guid can be used to track back to the
specific application instance that generated the metric.
Does that mean that I need to specifically add these properties myself into the tags of all my metrics ? I know I can do that by having a Bean MeterRegistryCustomizerreturning the following : registry -> registry.config().commonTags(tags) with tags beeing the properties that SCDF normally sets itself for integrationmetrics. Or SCDF adds to ALL metrics the properties ?
Thanks !
while your observation about the MetricsCollector is "generally" correct, I believe there is an alternative (and perhaps cleaner) way to achieve what you've been trying by using the SCDF Micrometer metrics collection approach. I will try to explain both approaches below.
As the MetricsCollector precedes in time the Micrometer framework they both implement a quite different metrics processing flows. The primary goal for the Metrics Collector 2.x was to ensure backward compatibility with SpringBoot 1.x metrics. The MetricsCollector 2.x allows mixing metrics coming from both SpringBoot 1.x (pre micrometer) and Spring Boot 2.x (e.g. micrometer) app starters. The consequence of this decision is that the Collector 2.x supports only the common denominator of metrics available in Boot 1.x and 2.x. This requirement is enforced by pre-filtering only the integration.channel.* metrics. At the moment you would not be able to add more metrics without modifying the metrics collector code. If you think that supporting different Micrometer metrics is more important than having backward compatibility with Boot 1.x then please open an new issue in Metrics Collector project.
Still I believe that the approach explained below is better suited for you case!
Unlike the MetricsCollector approach, the "pure" Micrometer metrics are send directly to the selected Metrics registry (such as Prometheus, InfluxDB, Atlas and so on). As illustrated in the sample, the collected metrics can be analyzed and visualized with tools such as Grafana.
Follow the SCDF Metrics samples to setup your metrics collection via InfluxDB (or Prometheus) and Grafana. Later would allow you explore any out-of-the-box or custom Micrometer metrics. The downside of this approach (for the moment) is that you will not be able to visualize those metrics in the SCDF UI's pipeline. Still if you find it important to have such visualization inside the SCDF UI please open a new issue in the SCDF project (I have WIP for the Altals Micrometer Registry).
I hope that this sheds some light on the alternative approaches. We would be very interested to hear your feedback.
Cheers!

cosmosdb sql api vs mongodb api which one to use for my scenario.

I have a document called "chat"
"Chat": [
{
"User": {},
"Message": "i have a question",
"Time": "06:55 PM"
},
{
"User": {},
"Message": "will you be able to ",
"Time": "06:25 PM"
},
{
"User": {},
"Message": "ok i will do that",
"Time": "07:01 PM"
}
every time a new chat message arrives i should be able to simple append to this array.
mongodb API aggregation pipeline (preview) allows me to use things like $push $addToSet for that
if i use sql api i will have to pull the entire document every time modify it and create a new document every time.
Other Considerations :
This array can grow rapidly.
This "chat" document might also be nested into other document as well.
My Question
Does this means that mongodb API is better suited for this and sql api will have a performance hit for this scenario ?
Does this means that mongodb API is better suited for this and sql api
will have a performance hit for this scenario ?
It's hard to say which database is the best choice.
Yes,as found in the doc, Cosmos Mongo API supports $push and $addToSet which is more efficient. However,in fact, Cosmos Mongo API just supports a subset of the MongoDB features and translates requests into the Cosmos sql equivalent. So, maybe Cosmos Mongo API has some different behaviours and results. But the onus is on Cosmos Mongo API to improve their emulation of MongoDB.
When it comes to Cosmos Sql Api, partial update is not supported so far but it is hitting the road. You could commit feedback here. Currently, you need to update the entire document. Surely, you could use stored procedure to do this job to release pressure of your client side.
The next thing I want to say, which is the most important, is the limitation mentioned by #David. The document size has 2MB limitation in sql api and 4MB in mongo api:What is the size limit of a cosmosdb item?. Since your chat data is growing, you need to consider to split them. Then give the documents a partition key such as "type": "chatdata" to classify them.

cosmosdb - archive data older than n years into cold storage

I researched several places and could not find any direction on what options are there to archive old data from cosmosdb into a cold storage. I see for DynamoDb in AWS it is mentioned that you can move dynamodb data into S3. But not sure what options are for cosmosdb. I understand there is time to live option where the data will be deleted after certain date but I am interested in archiving versus deleting. Any direction would be greatly appreciated. Thanks
I don't think there is a single-click built-in feature in CosmosDB to achieve that.
Still, as you mentioned appreciating any directions, then I suggest you consider DocumentDB Data Migration Tool.
Notes about Data Migration Tool:
you can specify a query to extract only the cold-data (for example, by creation date stored within documents).
supports exporting export to various targets (JSON file, blob
storage, DB, another cosmosDB collection, etc..),
compacts the data in the process - can merge documents into single array document and zip it.
Once you have the configuration set up you can script this
to be triggered automatically using your favorite scheduling tool.
you can easily reverse the source and target to restore the cold data to active store (or to dev, test, backup, etc).
To remove exported data you could use the mentioned TTL feature, but that could cause data loss should your export step fail. I would suggest writing and executing a Stored Procedure to query and delete all exported documents with single call. That SP would not execute automatically but could be included in the automation script and executed only if data was exported successfully first.
See: Azure Cosmos DB server-side programming: Stored procedures, database triggers, and UDFs.
UPDATE:
These days CosmosDB has added Change feed. this really simplifies writing a carbon copy somewhere else.

Resources