Azure Data Explorer : Data Load Performance - azure-data-explorer

We are loading file of 100+GB into Azure Data Explorer using ADF's copy activity and it takes more than 13+ hours.
Is there any best practice to make the load in reasonable time frame ?

what is the source of the data ? if it Azure Storage and less than 5,000 blobs/files please use 1click to ingest the data from storage - https://learn.microsoft.com/en-us/azure/data-explorer/one-click-ingestion-new-table
if there is more than 5000 blobs/files = please use ingest historian data to ADX - https://learn.microsoft.com/en-us/azure/data-explorer/generate-lightingest-command

You can break it into two activities:
Use ADF to break this file into files that are ~1GB in size, writing them into a container on Azure storage.
Using another activity or using the methods in the answer above, ingest them into Azure Data Explorer.
You are also welcome to open a support ticket as the current behavior looks wrong.

In addition to other answers, check the CPU consumption of your cluster during ingestion. By enabling the autoscale (Scale out), you may reduce the loading time in some cases.

Related

Azure Synapse replicated to Cosmos DB?

We have a Azure data warehouse db2(Azure Synapse) that will need to be consumed by read only users around the world, and we would like to replicate the needed objects from the data warehouse potentially to a cosmos DB. Is this possible, and if so what are the available options? (transactional, merege, etc)
Synapse is mainly about getting your data to do analysis. I dont think it has a direct export option, the kind you have described above.
However, what you can do, is to use 'Azure Stream Analytics' and then you should be able to integrate/stream whatever you want to any destination you need, like an app or a database ands so on.
more details here - https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-integrate-azure-stream-analytics
I think you can also pull the data into BI, and perhaps setup some kind of a automatic export from there.
more details here - https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-get-started-visualize-with-power-bi

Kafka Connector for Oracle Database Source

I want to build a Kafka Connector in order to retrieve records from a database at near real time. My database is the Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 and the tables have millions of records. First of all, I would like to add the minimum load to my database using CDC. Secondly, I would like to retrieve records based on a LastUpdate field which has value after a certain date.
Searching at the site of confluent, the only open source connector that I found was the “Kafka Connect JDBC”. I think that this connector doesn’t have CDC mechanism and it isn’t possible to retrieve millions of records when the connector starts for the first time. The alternative solution that I thought is Debezium, but there is no Debezium Oracle Connector at the site of Confluent and I believe that it is at a beta version.
Which solution would you suggest? Is something wrong to my assumptions of Kafka Connect JDBC or Debezium Connector? Is there any other solution?
For query-based CDC which is less efficient, you can use the JDBC source connector.
For log-based CDC I am aware of a couple of options however, some of them require license:
1) Attunity Replicate that allows users to use a graphical interface to create real-time data pipelines from producer systems into Apache Kafka, without having to do any manual coding or scripting. I have been using Attunity Replicate for Oracle -> Kafka for a couple of years and was very satisfied.
2) Oracle GoldenGate that requires a license
3) Oracle Log Miner that does not require any license and is used by both Attunity and kafka-connect-oracle which is is a Kafka source connector for capturing all row based DML changes from an Oracle and streaming these changes to Kafka.Change data capture logic is based on Oracle LogMiner solution.
We have numerous customers using IBM's IIDR (info sphere Data Replication) product to replicate data from Oracle databases, (as well as Z mainframe, I-series, SQL Server, etc.) into Kafka.
Regardless of which of the sources used, data can be normalized into one of many formats in Kafka. An example of an included, selectable format is...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/kcopauditavrosinglerow.html
The solution is highly scalable and has been measured to replicate changes into the 100,000's of rows per second.
We also have a proprietary ability to reconstitute data written in parallel to Kafka back into its original source order. So, despite data having been written to numerous partitions and topics , the original total order can be known. This functionality is known as the TCC (transactionally consistent consumer).
See the video and slides here...
https://kafka-summit.org/sessions/exactly-once-replication-database-kafka-cloud/

cosmosdb - archive data older than n years into cold storage

I researched several places and could not find any direction on what options are there to archive old data from cosmosdb into a cold storage. I see for DynamoDb in AWS it is mentioned that you can move dynamodb data into S3. But not sure what options are for cosmosdb. I understand there is time to live option where the data will be deleted after certain date but I am interested in archiving versus deleting. Any direction would be greatly appreciated. Thanks
I don't think there is a single-click built-in feature in CosmosDB to achieve that.
Still, as you mentioned appreciating any directions, then I suggest you consider DocumentDB Data Migration Tool.
Notes about Data Migration Tool:
you can specify a query to extract only the cold-data (for example, by creation date stored within documents).
supports exporting export to various targets (JSON file, blob
storage, DB, another cosmosDB collection, etc..),
compacts the data in the process - can merge documents into single array document and zip it.
Once you have the configuration set up you can script this
to be triggered automatically using your favorite scheduling tool.
you can easily reverse the source and target to restore the cold data to active store (or to dev, test, backup, etc).
To remove exported data you could use the mentioned TTL feature, but that could cause data loss should your export step fail. I would suggest writing and executing a Stored Procedure to query and delete all exported documents with single call. That SP would not execute automatically but could be included in the automation script and executed only if data was exported successfully first.
See: Azure Cosmos DB server-side programming: Stored procedures, database triggers, and UDFs.
UPDATE:
These days CosmosDB has added Change feed. this really simplifies writing a carbon copy somewhere else.

What is the best way to get (stream) data from BigQuery to R (Rstudio server in Docker)

I have a number of large tables in Google BigQuery, containing data to be processed in R. I am running RStudio via Docker on Google Cloud Platform using the Container Engine.
I have tested a few routes with a table of 38 million rows (three columns) with a table size of 862 MB in BigQuery.
The first route I tested was using the R package bigrquery. This option was preferred as data can be directly queried from BigQuery. And data-acquisition can be incorporated in R loops. This option is unfortunately very slow, it takes close to an hour to complete.
The second option I tried was exporting the BigQuery table to a csv file on Google Cloud Storage (approx 1 minute), and using the public link to import in Rstudio (another 5 minutes). This route entails quite some manual handling, which is at least not desirable.
In Google Cloud Console I noticed VM instances can be granted access to BigQuery. Also, RStudio can be configured to have root access in its Docker container.
So finally my question: Is there a way to use this backdoor to enable fast data-transfer from BigQuery into an R dataframe in an automated way? Or are there other ways to achieve this goal?
Any help is highly appreciated!
Edit:
I have loaded the same table into a MySQL database hosted in Google Cloud SQL, this time it took only about 20 seconds to load the same amount of data. So some kind of translation from BigQuery to SQL is an option too.

Plugging another relational DB to OpenDS

Currently I'm working on a project with opends. I have to upload more than 200k entries in the OpenDS. But unfortunately its fails at random times when file limit exceeding more than 10k - 15k.
When I google for that particular error (alert ID 9896233: JE Database Environment corresponding to backend id userRoot is corrupt. Restart the Directory Server to reopen the Environment) it seems like openDS backend DB [BerklyDB] is not that reliable when adding massive number of entries. How can i plug in new commercial or open source reliable relational DB [Oracle/ H2] to the openDS. any configuration ? or do i have to change the openDS code ?
First you should be aware that Oracle has pulled the plug on the OpenDS project and it is now completely stalled. Development continues as open source as the OpenDJ project : http://opendj.forgerock.org.
This said, I believe that there is a problem with your environment. When I was still working on OpenDS, our basic stress test was importing and running very high load against 10 Millions users. 200K entries is not massive number. My daily OpenDJ tests on my laptop are done with 100K to 1M entries. We have customers running in production with OpenDJ with more than 20M entries, growing 40% every 6 months !
Berkeley DB has been proved to be very scalable and reliable.
Things you might want to check : what is the maximum number of files that can be opened by a single process on your machine ? Linux defaults to 1024 and the limit may be easy to hit with OpenDS or OpenDJ. Are you using a local filesystem ? Berkeley DB is not supported on networked FS such as NFS or other NAS.
Finally, check the logs/errors file and your systems log. Chances are that one of them will have a message containing the root cause of the problem (most likely logs/errors).
Kind regards,
Ludovic Poitou
ForgeRock - Product Manager for OpenDJ

Resources