emr-dynamodb-connector for Apache Hadoop - amazon-dynamodb

I have an EMR and intend to do CRUD operations on dynamo DB as part of my Reducer.
Note I am not using Hive or Spark and using Apache Hadoop. Is there any documentation on how to connect to Dynamo DB from my EMR ?

emr-dynamodb-connector is open source library and includes Hadoop classes like DynamoDBInputFormat, DefaultDynamoDBRecordReader for Reading data(With Parallel scans) from DynamoDB with Read rate control &
DynamoDBOutputFormat DefaultDynamoDBRecordWriter for writing(using BatchWrites API) to DynamoDB with write rate control to avoid throttling.
I don't think there's any more AWS Documentation on this one other than README of this open source lib.
All EMR clusters should have a pre-build package of this library(except emr-dynamodb-tools) usually # /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar and included in classpath of EMR Hadoop. So, you can just use Hadoop InputFormat and OutputFormat Implementation's from this JAR on your MR Application by setting required config's(including DynamoDB config's) using Job Configuration

Related

Managing DBT profile file in MWAA

I would like to use DBT in MWAA Airflow enviroment. To achieve this I need to install DBT in the managed environment and from there run the dbt commands via the Airflow operators or CLI (BashOperator).
My problem with solution is that I need store the dbt profile file(s) -which contains the target / source database credentials- in S3. Otherwise the file is not going to be deployed to the Airflow worker nodes hence cannot be used by dbt.
Is there any other option? I feel this is a big security risk and also undermines the use of Airflow (because I would like to use its inbuilt password manager)
My ideas:
Create the profile file on the fly in the Airflow dag as a task and
write it out to local. I do not think this is a feasible workaround, because there is no guarantee that the dbt task is going to run on the same worker node which my code created.
Move the profile file manually to S3 (Exclude it from CI/CD). Again, I see a security risk, as I am storing credentials on S3.
Create a custom operator, which builds the profile file on the same machine as command will run. Maintenance nightmare.
Use MWAA environment variables (https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-env-variables.html) and combine it with dbt's env_var command. (https://docs.getdbt.com/reference/dbt-jinja-functions/env_var)
Storing credentials in System wide EVs, this way feels awkward.
Any good ideas or best practices?
#PeterRing, in our case we use Dbt Cloud. Once the connection is set up in the Airflow UI, you are calling Dbt Job IDs to trigger the job (then using a sensor to monitor it until it completes).
If you can't use Dbt Cloud, perhas you can use AWS Secrets Manager to store your db profile/creds: Configuring an Apache Airflow connection using a Secrets Manager secret

How to configure druid batch indexing jobs dynamic EMR cluster for batch ingestion?

I am trying to automate druid batch ingestion using Airflow. My data pipeline creates EMR cluster on demand and shut it down once druid indexing is completed. But for druid we need to have Hadoop configurations in druid server folder ref. This is blocking me from dynamic EMR clusters. Can we override Hadoop connection details in Job configuration or is there a way to support multiple indexing jobs to use different EMR clusters ?
I have tried out overriding the parameters ( Hadoop configuration) in core-site.xml,yarn-site.xml,mapred-site.xml,hdfs-site.xml as Job properties in druid indexing job. It worked. In that case no need of copying the above files in druid server.
Just used below python program to convert the properties to json key value pairs from xml files. Can do the same for all the files and pass everything as indexing job payload. The below thing can be automated using airflow after creating different EMR clusters.
import json
import xmltodict
path = 'mypath'
file = 'yarn-site.xml'
with open(os.path.join(path,file)) as xml_file:
data_dict = xmltodict.parse(xml_file.read())
xml_file.close()
druid_dict = {property.get('name'):property.get('value') for property in data_dict.get('configuration').get('property') }
print(json.dumps(druid_dict)) ```
In researching how this might be done, I found hadoopDependencyCoordinates property here: https://druid.apache.org/docs/0.22.1/ingestion/hadoop.html#task-syntax
which seems relevant.

How to access on premise Teradata from Azure Databricks

We need to connect to on premise Teradata from Azure Databricks .
Is that possible at all ?
If yes please let me know how .
I was looking for this information as well and I recently was able to access our Teradata instance from Databricks. Here is how I was able to do it.
Step 1. Check your cloud connectivity.
%sh nc -vz 'jdbcHostname' 'jdbcPort'
- 'jdbcHostName' is your Teradata server.
- 'jdbcPort' is your Teradata server listening port. By default, Teradata listens to the TCP port 1025
Also check out Databrick’s best practice on connecting to another infrastructure.
Step 2. Install Teradata JDBC driver.
Teradata Downloads page provides JDBC drivers by version and archive type. You can also check the Teradata JDBC Driver Supported Platforms page to make sure you pick the right version of the driver.
Databricks offers multiple ways to install a JDBC library JAR for databases whose drivers are not available in Databricks. Please refer to the Databricks Libraries to learn more and pick the one that is right for you.
Once installed, you should see it listed in the Cluster details page under the Libraries tab.
Terajdbc4.jar dbfs:/workspace/libs/terajdbc4.jar
Step 3. Connect to Teradata from Databricks.
You can define some variables to let us programmatically create these connections. Since my instance required LDAP, I added LOGMECH=LDAP in the URL. Without LOGMECH=LDAP it returns “username or password invalid” error message.
(Replace the text in italic to the values in your environment)
driver = “com.teradata.jdbc.TeraDriver”
url = “jdbc:teradata://Teradata_database_server/Database=Teradata_database_name,LOGMECH=LDAP”
table = “Teradata_schema.Teradata_tablename_or_viewname”
user = “your_username”
password = “your_password”
Now that the connection variables are specified, you can create a DataFrame. You can also explicitly set this to a particular schema if you have one already. Please refer to Spark SQL Guide for more information.
Now, let’s create a DataFrame in Python.
My_remote_table = spark.read.format(“jdbc”)\
.option(“driver”, driver)\
.option(“url”, url)\
.option(“dbtable”, table)\
.option(“user”, user)\
.option(“password”, password)\
.load()
Now that the DataFrame is created, it can be queried. For instance, you can select some particular columns to select and display within Databricks.
display(My_remote_table.select(“EXAMPLE_COLUMN”))
Step 4. Create a temporary view or a permanent table.
My_remote_table.createOrReplaceTempView(“YOUR_TEMP_VIEW_NAME”)
or
My_remote_table.write.format(“parquet”).saveAsTable(“MY_PERMANENT_TABLE_NAME”)
Step 3 and 4 can also be combined if the intention is to simply create a table in Databricks from Teradata. Check out the Databricks documentation SQL Databases Using JDBC for other options.
Here is a link to the write-up I published on this topic.
Accessing Teradata from Databricks for Rapid Experimentation in Data Science and Analytics Projects
If you create a virtual network that can connect to on prem then you can deploy your databricks instance into that vnet. See https://docs.azuredatabricks.net/administration-guide/cloud-configurations/azure/vnet-inject.html.
I assume that there is a spark connector for terradata. I haven't used it myself but I'm sure one exists.
You can't. If you run Azure Databricks, all the data needs to be stored in Azure. But you can call the data using REST API from Teradata and then save data in Azure.

Move and transform data between databases using Airflow

Using airflow, I extract data from a MySQL database, transform it with python and load it into a Redshift cluster.
Currently I use 3 airflow tasks : they pass the data by writing CSV on local disk.
How could I do this without writing to disk ?
Should I write one big task in python ? ( That would lower visibility )
Edit: this is a question about Airflow, and best practice for choosing the granularity of tasks and how to pass data between them.
It is not a general question about data migration or ETL. In this question ETL is only used as an exemple of workload for airflow tasks.
There are different ways you can achieve this:
If you are using AWS RDS service for MySQL, you can use AWS Data Pipeline to transfer data from MySQL to Redshift. They have inbuilt template in AWS Data Pipeline to do that. You can even schedule the incremental data transfer from MySQL to Redshift
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-redshift.html
How large is your table?
If your table is not too large and you can read the whole table into python using Pandas DataFrame or tuples and then transfer it Redshift.
Even if you have large table still you can read that table in chunks and push each chunk to Redshift.
Pandas are little inefficient in terms of memory usage if you read table into it.
Creating different tasks in Airflow will not help much. Either you can create a single function and call that function in dag using PythonOperator or create a python script and execute it using BashOperator in dag
One possibility is using the GenericTransfer operator from airflow. See docs
This only works with smallish datasets and the mysqlhook of airflow uses MySQLdb which does not support python 3.
Otherwise, I dont think there are other options, when using airflow, than writing to disk.
How large is your database?
Your approach of writing CSV on a local disk is optimal with a small database, so if this is the case you can write a Python task for that.
As the database get larger there will be more COPY commands and error prone uploading because you’re dealing with billions of rows of data spread across multiple MySQL tables.
You will also have to figure out exactly in which CSV file something went wrong.
It is also important to determine whether you need high throughput, high latency or frequent schema changes.
In conclusion, you should consider a third-party option like Alooma to extract data from a MySQL database and load it into your Redshift cluster.
I have done similar task before, but my system was in GCP.
What I did there was to write the data queried out into AVRO files, which can be easily (and very efficiently) be ingested into BigQuery.
So there is one task in the dag to query out the data and write to an AVRO file in Cloud Storage (S3 equivalent). And one task after that to call BigQuery operator to ingest the AVRO file.
You can probably do similar with csv file in S3 bucket, and then RedShift COPY command from the csv file in S3. I believe RedShift COPY from file in S3 is the fastest way to ingest data into RedShift.
These tasks are implemented as PythonOperators in Airflow.
You can pass information between tasks using XCom. You can read up on it in the documentation and there is also an example in the set of sample DAGs installed with Airflow by default.

oozie to load file with passwords from HDFS

I have a file with DB connection properties (including passwords) in HDFS that needs to be accessible to all oozie jobs.
I am looking for a strategy to use this file using oozie actions in order to connect to DB.
I wonder if creating hive table to load that file and query them via oozie hive action is good strategy to consider.

Resources