I am trying to replicate data from cloudra HDFS to AWS cloud Storage. This is not a one time movement, I need to replicate data between system. I see there are two options available. Cloudera BDR and Distcp.
Could someone please let me know what is the difference between those two? If I understand correct cloudera BDR uses distcp underneath only to copy data.
I know BDR provides
job scheduling and
Hive metadata extraction and moving to cloud. (not required)
however, apart from that what extra feature it provides which would be the case for choosing BDR.
Related
I want to upgrade OKD cluster from 4.5.0-0.okd-2020-10-03-012432 to 4.5.0-0.okd-2020-10-15-235428
version in restricted network.
I could not find any steps on OKD documentation site. However, steps are present on OCP documentation site and looks straight forward.
Queries:
Is this scenario supported in OKD?
In below document at step #7, what could be corresponding step for OKD.
https://docs.openshift.com/container-platform/4.5/updating/updating-restricted-network-cluster.html#update-configuring-image-signature
Where can I get image signature for OKD? Is this step valid for OKD?
I figured it out.
I did not perform steps mentioned in https://docs.openshift.com/container-platform/4.5/updating/updating-restricted-network-cluster.html#update-configuring-image-signature
"--apply-release-image-signature" flag in "oc adm release mirror..." command creates configmap automatically.
I want to build a Kafka Connector in order to retrieve records from a database at near real time. My database is the Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 and the tables have millions of records. First of all, I would like to add the minimum load to my database using CDC. Secondly, I would like to retrieve records based on a LastUpdate field which has value after a certain date.
Searching at the site of confluent, the only open source connector that I found was the “Kafka Connect JDBC”. I think that this connector doesn’t have CDC mechanism and it isn’t possible to retrieve millions of records when the connector starts for the first time. The alternative solution that I thought is Debezium, but there is no Debezium Oracle Connector at the site of Confluent and I believe that it is at a beta version.
Which solution would you suggest? Is something wrong to my assumptions of Kafka Connect JDBC or Debezium Connector? Is there any other solution?
For query-based CDC which is less efficient, you can use the JDBC source connector.
For log-based CDC I am aware of a couple of options however, some of them require license:
1) Attunity Replicate that allows users to use a graphical interface to create real-time data pipelines from producer systems into Apache Kafka, without having to do any manual coding or scripting. I have been using Attunity Replicate for Oracle -> Kafka for a couple of years and was very satisfied.
2) Oracle GoldenGate that requires a license
3) Oracle Log Miner that does not require any license and is used by both Attunity and kafka-connect-oracle which is is a Kafka source connector for capturing all row based DML changes from an Oracle and streaming these changes to Kafka.Change data capture logic is based on Oracle LogMiner solution.
We have numerous customers using IBM's IIDR (info sphere Data Replication) product to replicate data from Oracle databases, (as well as Z mainframe, I-series, SQL Server, etc.) into Kafka.
Regardless of which of the sources used, data can be normalized into one of many formats in Kafka. An example of an included, selectable format is...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/kcopauditavrosinglerow.html
The solution is highly scalable and has been measured to replicate changes into the 100,000's of rows per second.
We also have a proprietary ability to reconstitute data written in parallel to Kafka back into its original source order. So, despite data having been written to numerous partitions and topics , the original total order can be known. This functionality is known as the TCC (transactionally consistent consumer).
See the video and slides here...
https://kafka-summit.org/sessions/exactly-once-replication-database-kafka-cloud/
We are using Cloudera as our hadoop environment.
Can someone please provide any guildance on how to integrate or migrate existing parquet/impala to kudu/impala to hopefully get a performance improvement to our existing pipeline?
Our existing pipeline is brief'ed here:
We receive data in the format of csv/xlsx;
We move them onto HDFS;
We save them to another location in the format of parquet;
We create external table in impala with the location pointing to the partitioned parquet data;
We do our ETL jobs within pyspark, spark scala, spark sql;
We output our analytical result to csv.
Existing pipeline is working as expected, however, as data maintains an ongoing growth, the time/resource needed for the pipeline also increases.
We are wondering what is the best practice to migrate the parquet-based impala to kudu-based impala for a better overall performance?
Thank you very much.
We are using Nitrogen-SR3 release and using customized 2 node cluster. We have to support "Rolling Upgrade" (no downtime to cluster) of our entire application. Please let us know can we add/remove nodes dynamically from/to "Akka" cluster and form a new cluster (after upgrade of all nodes) programmatically. If yes, could you please give us the steps.
Thanks
I have about 100GB data in BigQuery, and I'm fairly new to using data analysis tools. I want to grab about 3000 extracts for different queries, using a programmatic series of SQL queries, and then run some statistical analysis to compare kurtosis across those extracts.
Right now my workflow is as follows:
running on my local machine, use BigQuery Python client APIs to grab the data extracts and save them locally
running on my local machine, run kurtosis analysis over the extracts using scipy
The second one of these works fine, but it's pretty slow and painful to save all 3000 data extracts locally (network timeouts, etc).
Is there a better way of doing this? Basically I'm wondering if there's some kind of cloud tool where I could quickly run the calls to get the 3000 extracts, then run the Python to do the kurtosis analysis.
I had a look at https://cloud.google.com/bigquery/third-party-tools but I'm not sure if any of those do what I need.
So far Cloud Datalab is your best option
https://cloud.google.com/datalab/
It is in beta so some surprises are possible
Datalab is built on top of below (Jupyter/IPython) option and totally in cloud
Another option is Jupyter/IPython Notebook
http://jupyter-notebook-beginner-guide.readthedocs.org/en/latest/
Our data sience team started with second option long ago with great success and now are moving toward Datalab
For the rest of the business (prod, bi, ops, sales, marketing, etc.), though, we had to build our own workflow/orchestration tool as nothing around was found good or relevant enough.
two easy ways:
1: if your issue is network like you say, use a google compute engine machine to do the analisis, in the same zone as your bigquery tables (us, eu etc). it will not have network issues getting data from bigquery and will be super-fast.
the machine will only cost you for the minutes you use it. save a snapshot of your machine to reuse the machine setup anytime (snapshot also has monthly cost but much lower than having the machine up.)
2: use Google cloud Datalab (beta as of dec. 2015) which supports bigquery sources and gives you all the tools you need to do the analysis and later share it with others:
https://cloud.google.com/datalab/
from their docs: "Cloud Datalab is built on Jupyter (formerly IPython), which boasts a thriving ecosystem of modules and a robust knowledge base. Cloud Datalab enables analysis of your data on Google BigQuery, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions)."
You can check out Cooladata
It allows you to query BQ tables as external data sources.
What you can do is either schedule your queries and export the results to Google storage, where you can pick up from there, or use the built in powerful reporting tool to answer your 3000 queries.
It will also provide you all the BI tools you will need for your business.