Load Oracle 11g data to HDFS using flume - oracle11g

Is there a way to use flume to transmit my Oracle 11g Database data to HDFS?
I know flume is made for logs and Sqoop should be use to transmit data from Database. But is there a way to use flume instead of Sqoop? What should I do if I want to use this kind of architecture?

Please have a look in to
1) Oracle Golden gate
2) Streaming Oracle Database Logs to HDFS with Flume

The other way to do is For the present data in oracle you can run sqoop and for the subsequent changes you can use Linked in databus for change data capture(CDC) which can post messages to Kafka.
Messages from kafka can be easily consumed by Flume.

Related

how to work with kafka-kusto-sink and debug

ingesting data to kakfka cluster that can send data to adx kusto db by using kafka-sink-azure-kusto .
iam successfully ingesting data to kafka cluster and its not transferring data to kusto db. how to debug this? any logs i can verify.
i have tried to check broker log no errors there
ref:https://github.com/Azure/kafka-sink-azure-kusto/blob/master/README.md
Could you please provide more information about how you are running Kafka, and how did you set up the connector?
Debugging steps would be:
Broker logs should mention that connector was picked up properly, did you see that line in the logs?
Looking at Connector logs should show more info about what is actually going on under the hood. Maybe you will see some errors there. /var/log/connect-distributed.log
Try to ingest data via any other method? like one of the SDK's
Try running the setup according to steps detailed under delpoy
Update: more info about connector setup in general can be found at this SO question: Kafka connect cluster setup or launching connect workers
Also, confluent has some helpful docs:https://docs.confluent.io/current/connect/userguide.html

Streaming data from Oracle 11g to Kafka

I am looking for a solution to stream data from Oracle 11g to Kafka. I was hoping to use GoldenGate, but that only seems to be available for Oracle 12c. Is the Confluent platform the best way to go?
Thanks!
First, the general answer would be: The best way to connect Oracle (databases) to Kafka is indeed to use Confluent Platform with Kafka's Connect API in combination with a ready-to-use connector for GoldenGate. See the GoldenGate/Oracle entry in section "Certified Connectors" at https://www.confluent.io/product/connectors/. The listed Kafka connector for GoldenGate is maintained by Oracle.
Is the Confluent platform the best way to go?
Hence, in general, the answer to the above question is: "Yes, it is."
However, as you pointed out for your specific question about Oracle versions, Oracle unfortunately has the following information in the README of their GoldenGate connector:
Supported Versions
The Oracle GoldenGate Kafka Connect Handler/Formatter is coded and
tested with the following product versions.
Oracle GoldenGate for Big Data 12.2.0.1.1
Confluent IO Kafka/Kafka Connect 0.9.0.1-cp1
Porting may be required for Oracle GoldenGate Kafka Connect
Handler/Formatter to work with other versions of Oracle GoldenGate for
Big Data and/or Confluent IO Kafka/Kafka Connect
This means that the connector does not work with Oracle 11g, at least as far as I can tell.
Sorry if that doesn't answer your specific question. At least I wanted to give you some feedback on the general approach. If I do come across a more specific answer, I'll update this text.
Update Mar 15, 2017: The best option you have at the moment is to use the Confluent's JDBC connector. That connector can't give you quite the same feature set as Oracle's native GoldenGate connector though.
Oracle GoldenGate and Confluent Platform are not comparable.
Confluent Platform provides the complete streaming platform and is a collection of multiple software which can be used for streaming your data, where as GoldenGate is replication and data-integration software.
Also GoldenGate is highly reliable for db replication since it maintains transactional integrity, same cannot be said for Kafka Mirror Maker or Confluent's Replicator at this time.
If you want just pure transactions - please also consider using OpenLogReplicator. It supports Oracle database from version 11.2.0.1.
It can produce transactions to Kafka in 2 formats:
Classic format - when every transaction is one Kafka message (multiple DMLS per Kafka message)
Debezium style format - transactions are divided - every DML is one Kafka message
There is already a working version. You can try it.
Right now I am using ojdbc6 to connect to Oracle 11g. It is good enough but not perfect especially when using pooling mode to check if there are new updates on the original tables.
I tried also to read all tables using certain pattern but this did not work well.
The best mode to connect an Oracle DB to Kafka (especially when the tables are very wide, columns wise, is to use queries for the connectors. This way, you ensure that you pick the right fields and do some casting for numbers if you are using avro.

Is it possible to use single table for multiple connections in cassandra

Is it possible to use single table for multiple connections in cassandra?
I am using a spring data for java connection and am trying to use python flask for raspberry connection.
Yes, you can use single tale for multiple connections. I'm using Cassandra 2.2 where I connect using Java Astyanax driver and python pycassa client.
Are you referring about table lock for transactions? Cassandra does not use any locking mechanism on data, the below docs link describes more on transactions and consistency in Cassandra:
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_about_transactions_c.html
Mention what you tried and problems you faced.

How to use ODBC to connect to any DBMS

I'm developping a java application and i'm using JDBC to connect to MySQL Database, now i want to use ODBC to be able to get and retrieve data from any DBMS, of course if have access to it. Is there an API or tool to do this ?
What you are looking for is a JDBC-ODBC bridge. There are several available. It is not recommended, instead you should always use a native JDBC driver.

how is flume distributed?

I am working with flume to ingest a ton of data into hdfs (about petabytes of data). I would like to know how is flume making use of its distributed architecture? I have over 200 servers and I have installed flume in one of them from where I would get the data from (aka data source) and the sink is the hdfs. (hadoop is running over serengeti in these servers). I am not sure whether flume distributes itself over the cluster or I have installed it incorrectly. I followed apache's user guide for flume installation and this post of SO.
How to install and configure apache flume?
http://flume.apache.org/FlumeUserGuide.html#setup
I am a newbie to flume and trying to understand more about it..Any help would be greatly appreciated. Thanks!!
I'm not going to speak to Cloudera's specific recommendations but instead to Apache Flume itself.
It's distributed however you decide to distribute it. Decide on your own topology and implement it.
You should think of Flume as a durable pipe. It has a source (you can choose from a number), a channel (you can choose from a number) and a sink (again, you can choose from a number). It is pretty typical to use an Avro sink in one agent to connect to an Avro source in another.
Assume you are installing Flume to gather Apache webserver logs. The common architecture would be to install Flume on each Apache webserver machine. You would probably use the Spooling Directory Source to get the Apache logs and the Syslog Source to get syslog. You would use the memory channel for speed and so as not to affect the server (at the cost of durability) and use the Avro sink.
That Avro sink would be connected, via Flume load balancing, to 2 or more collectors. The collectors would be Avro source, File channel and whatever you wanted (elasticsearch?, hdfs?) as your sink. You may even add another tier of agents to handle the final output.
In the latest version, Apache Flume no longer follows master-slave architecture. It is deprecated after Flume 1.x.
There is no longer a Master, and no Zookeeper dependency. Flume now runs with a simple file-based configuration system.
If we want it to scale, we need to install it in multiple physical nodes and run our own topology. As far as single node is considered.
Say we hook to a JMS server that gives 2000 XML events per second, and I need two Fulme agents to get that data, I have two distributed options:
Two Flume agents started and running to get JMS data in same physical node.
Two Flume agents started and running to get JMS data in two physical nodes.

Resources