Debezium MySQL schema_only snapshot mode semantics - mariadb

We are using MariaDB with Debezium and dealing with upgrades. The process is to upgrade one DB host and then the next, etc. We're trying to minimize downtime and to avoid a snapshot of the data in the DB because it's quite large.
We could accept missing events/inconsistent snapshot of the data while we point the Debezium connector from the old DB to the new (upgraded) DB.
I'm seeking clarification of language in the Debezium MySQL connector enter link description here, specifically this bit:
*schema_only - the connector runs a snapshot of the schemas and not the data. This setting is useful when you do not need the topics to contain a consistent snapshot of the data but need them to have only the changes since the connector was started.
*
Does this mean that the connector will start, read the schema and then start producing data change events as they subsequently occur?

Related

Airflow Metadata DB = airflow_db?

I have a project requirement to back-up Airflow Metadata DB to some data warehouse (but not using an Airflow DAG). At the same time, the requirement mentions some connection called airflow_db.
I am quite new to Airflow, so I googled a bit on the topic. I am a bit confused about this part. Our Airflow Metadata DB is PostgreSQL (this is built from docker-compose, so I am tinkering on a local install), but when I look at Connections in Airflow Web UI, it says airflow_db is MySQL.
I initially assumed that they are the same, but by the looks of it, they aren't? Can someone explain the difference and what they are for?
Airflow creates airflow_db Conn Id with MySQL by default (see source code)
Default connections are not really useful in production system. It's just a long list of stuff that you are probably not going to use.
Airflow 1.1.10 introduced the ability not to create the default list by setting:
load_default_connections = False in airflow.cfg (See PR)
To give more background the connection list is where hooks find the information needed in order to connect to a service. It's not related to the backend database. Though the backend is db like any db and if you wish to allow hooks to interact with it you can define it in the list like any other connection (which is probably why you have this as option in the default).

Azure Synapse replicated to Cosmos DB?

We have a Azure data warehouse db2(Azure Synapse) that will need to be consumed by read only users around the world, and we would like to replicate the needed objects from the data warehouse potentially to a cosmos DB. Is this possible, and if so what are the available options? (transactional, merege, etc)
Synapse is mainly about getting your data to do analysis. I dont think it has a direct export option, the kind you have described above.
However, what you can do, is to use 'Azure Stream Analytics' and then you should be able to integrate/stream whatever you want to any destination you need, like an app or a database ands so on.
more details here - https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-integrate-azure-stream-analytics
I think you can also pull the data into BI, and perhaps setup some kind of a automatic export from there.
more details here - https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-get-started-visualize-with-power-bi

Kafka Connector for Oracle Database Source

I want to build a Kafka Connector in order to retrieve records from a database at near real time. My database is the Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 and the tables have millions of records. First of all, I would like to add the minimum load to my database using CDC. Secondly, I would like to retrieve records based on a LastUpdate field which has value after a certain date.
Searching at the site of confluent, the only open source connector that I found was the “Kafka Connect JDBC”. I think that this connector doesn’t have CDC mechanism and it isn’t possible to retrieve millions of records when the connector starts for the first time. The alternative solution that I thought is Debezium, but there is no Debezium Oracle Connector at the site of Confluent and I believe that it is at a beta version.
Which solution would you suggest? Is something wrong to my assumptions of Kafka Connect JDBC or Debezium Connector? Is there any other solution?
For query-based CDC which is less efficient, you can use the JDBC source connector.
For log-based CDC I am aware of a couple of options however, some of them require license:
1) Attunity Replicate that allows users to use a graphical interface to create real-time data pipelines from producer systems into Apache Kafka, without having to do any manual coding or scripting. I have been using Attunity Replicate for Oracle -> Kafka for a couple of years and was very satisfied.
2) Oracle GoldenGate that requires a license
3) Oracle Log Miner that does not require any license and is used by both Attunity and kafka-connect-oracle which is is a Kafka source connector for capturing all row based DML changes from an Oracle and streaming these changes to Kafka.Change data capture logic is based on Oracle LogMiner solution.
We have numerous customers using IBM's IIDR (info sphere Data Replication) product to replicate data from Oracle databases, (as well as Z mainframe, I-series, SQL Server, etc.) into Kafka.
Regardless of which of the sources used, data can be normalized into one of many formats in Kafka. An example of an included, selectable format is...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/kcopauditavrosinglerow.html
The solution is highly scalable and has been measured to replicate changes into the 100,000's of rows per second.
We also have a proprietary ability to reconstitute data written in parallel to Kafka back into its original source order. So, despite data having been written to numerous partitions and topics , the original total order can be known. This functionality is known as the TCC (transactionally consistent consumer).
See the video and slides here...
https://kafka-summit.org/sessions/exactly-once-replication-database-kafka-cloud/

Re-run all changes in Lease Collection

I created several new pipelines in Azure Data Factory to process CosmosDB Change Feed (which go into Blob storage for ADF processing to on-prem SQL Server), and I'd like to "resnap" the data from the leases collection to force a full re-sync. Is there a way to do this?
For clarity, my set-up is:
Change Feed ->
Azure Function to process the changes -> Blob Storage to hold the JSON documents -> Azure Data Factory which picks up the Blob Storage documents and maps them to on-prem SQL Server stored proc inserts/updates.
The easiest and simplest way is to do it is to simply delete the lease documents and make sure that the StartFromBeginning setting is set to true. Once restarted the change feed service will recreate the leases (if the appropriate setting is configured to true) and reprocess all the documents.
The other way to do so is to update every single lease document and reset the Continuation token "checkpoint" to null, however I don't recommend this method since you might accidentally miss a lease which can lead to issues.

wso2 "complete and same" master-datasources.xml on all five WSO2 API-M components

I'm setting WSO2 APIM HA in distributed environment and I have some challanges using this documentation.
Documentation states: Note: When configuring clustering, ignore the WSO2_CARBON_DB data source configuration.
Question is, do I really cannot use CARBON db instead od UM un REG databases in HA?
Documentation mentions to configure following:
AM DB - in the Publisher, Store, and Key Manager nodes
UM DB - in the Publisher, Store, and Key Manager nodes
REG DB - in the API Publisher and Store nodes. (single tenant)
MB DB - in the Traffic manager nodes (each TM own DB)
Question is, can I completely fill one master-datasources.xml file and overwrite it on all components so I would not have to edit it on each server? (only editing the second TM datasource to aim to the second MB DB)
Yes, that is fine if you completely fill only one master-datasource.xml file & overwrite it on all other components. (except WSO2_MB_STORE_DB which is MB DB)
But MB DB (WSO2_MB_STORE_DB ) has to be separate for each node. As this DB is used for traffic as well as internally by Throttling policies, which has very high rate of DB transactions.
It will work if you don't keep WSO2_MB_STORE_DB separate, but it will have large number of DB transactions which can slower down your single DB. So it's Highly Advisable to maintain separate DB on each node. It will also help you in easy DEBUGGING in PROD environments.

Resources