Move and transform data between databases using Airflow - airflow

Using airflow, I extract data from a MySQL database, transform it with python and load it into a Redshift cluster.
Currently I use 3 airflow tasks : they pass the data by writing CSV on local disk.
How could I do this without writing to disk ?
Should I write one big task in python ? ( That would lower visibility )
Edit: this is a question about Airflow, and best practice for choosing the granularity of tasks and how to pass data between them.
It is not a general question about data migration or ETL. In this question ETL is only used as an exemple of workload for airflow tasks.

There are different ways you can achieve this:
If you are using AWS RDS service for MySQL, you can use AWS Data Pipeline to transfer data from MySQL to Redshift. They have inbuilt template in AWS Data Pipeline to do that. You can even schedule the incremental data transfer from MySQL to Redshift
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-redshift.html
How large is your table?
If your table is not too large and you can read the whole table into python using Pandas DataFrame or tuples and then transfer it Redshift.
Even if you have large table still you can read that table in chunks and push each chunk to Redshift.
Pandas are little inefficient in terms of memory usage if you read table into it.
Creating different tasks in Airflow will not help much. Either you can create a single function and call that function in dag using PythonOperator or create a python script and execute it using BashOperator in dag

One possibility is using the GenericTransfer operator from airflow. See docs
This only works with smallish datasets and the mysqlhook of airflow uses MySQLdb which does not support python 3.
Otherwise, I dont think there are other options, when using airflow, than writing to disk.

How large is your database?
Your approach of writing CSV on a local disk is optimal with a small database, so if this is the case you can write a Python task for that.
As the database get larger there will be more COPY commands and error prone uploading because you’re dealing with billions of rows of data spread across multiple MySQL tables.
You will also have to figure out exactly in which CSV file something went wrong.
It is also important to determine whether you need high throughput, high latency or frequent schema changes.
In conclusion, you should consider a third-party option like Alooma to extract data from a MySQL database and load it into your Redshift cluster.

I have done similar task before, but my system was in GCP.
What I did there was to write the data queried out into AVRO files, which can be easily (and very efficiently) be ingested into BigQuery.
So there is one task in the dag to query out the data and write to an AVRO file in Cloud Storage (S3 equivalent). And one task after that to call BigQuery operator to ingest the AVRO file.
You can probably do similar with csv file in S3 bucket, and then RedShift COPY command from the csv file in S3. I believe RedShift COPY from file in S3 is the fastest way to ingest data into RedShift.
These tasks are implemented as PythonOperators in Airflow.

You can pass information between tasks using XCom. You can read up on it in the documentation and there is also an example in the set of sample DAGs installed with Airflow by default.

Related

Airflow best practices - How to correctly approach a workflow

I'm implementing a simple workflow in which I have three different data sources (API, parquet file and PostgreSQL database). The goal is to gather the data from all the different sources and store it in a PostgreSQL warehouse.
The task flow I projected goes like:
Create PostrgreSQL DW >> [Get data from source 1, Get data from source 2, Get data from source 3] >> Insert data into PostrgreSQL DW
In order for this to work, I would have to share the data from the "Get Data" tasks to the "Insert Data" task.
My questions are:
Is sharing data between tasks a bad/wrong thing to do?
Should I approach this any other way?
If I implement a task to get the data from the source and then insert it to another database, wouldn't it not be idempotent?
Airflow is primarily an orchestrator. Sharing small snippets of data between tasks is encouraged with XComs, but large amounts of data are not supported. XComs are stored in the Airflow database which would quickly fill up if you used this pattern often.
Your use case sounds more suited to Apache Beam which is designed to process data in parallel at scale. It's much more common to use Airflow to schedule your beam pipelines, which do the actual work of ETL.
There is an Airflow Operator for Apache Beam. Depending on the size of your data you can process it locally on the Airflow workers with the DirectRunner. Or if you need to process large amounts of data you can offload the pipeline execution to a cloud solution like GCP's Dataflow using Beam's DataflowRunner.
The Airflow + Beam pattern is much more common and a powerful combination when dealing with data. Even if your datasets are small this pattern will let you scale with no further effort required if you need to in the future.

How to configure druid batch indexing jobs dynamic EMR cluster for batch ingestion?

I am trying to automate druid batch ingestion using Airflow. My data pipeline creates EMR cluster on demand and shut it down once druid indexing is completed. But for druid we need to have Hadoop configurations in druid server folder ref. This is blocking me from dynamic EMR clusters. Can we override Hadoop connection details in Job configuration or is there a way to support multiple indexing jobs to use different EMR clusters ?
I have tried out overriding the parameters ( Hadoop configuration) in core-site.xml,yarn-site.xml,mapred-site.xml,hdfs-site.xml as Job properties in druid indexing job. It worked. In that case no need of copying the above files in druid server.
Just used below python program to convert the properties to json key value pairs from xml files. Can do the same for all the files and pass everything as indexing job payload. The below thing can be automated using airflow after creating different EMR clusters.
import json
import xmltodict
path = 'mypath'
file = 'yarn-site.xml'
with open(os.path.join(path,file)) as xml_file:
data_dict = xmltodict.parse(xml_file.read())
xml_file.close()
druid_dict = {property.get('name'):property.get('value') for property in data_dict.get('configuration').get('property') }
print(json.dumps(druid_dict)) ```
In researching how this might be done, I found hadoopDependencyCoordinates property here: https://druid.apache.org/docs/0.22.1/ingestion/hadoop.html#task-syntax
which seems relevant.

Is it possible to combine two different database utility like Teradata's Fast export to Greenplum's GPloader?

Usually, I use a JDBC connection with some ETL tool to move data from one database(i.e Teradata) to another database(i.e Greenplum).
However, both of these databases comes with inbuilt utilities which can load/export huge amounts of data very fast, far faster than JDBC!. But the downside as far as I am aware of is that it can do so only to/from a file.
So, if I want to use them I have to Follow a process like-
Teradata ---(Fast Export)---> File ---(Gploader)---> Greenplum
I am wondering if it is possible to skip the File part and Combine the two utilities.
Teradata ---(FastExport & Gploader)--> Greenplum.
That way I can transfer huge amounts of data very quickly!
Yes, you most certainly can. Greenplum supports all kinds of external tables. One solution is to use an External Table that executes a command. That command can be a Java program that connects to Teradata to get data and uses the FastExport option.
I wrote the tool "gplink" to do just this. It automates the creation of Greenplum External Tables for JDBC sources.
Github:
https://github.com/pivotalguru/gplink
Teradata connection example:
https://github.com/pivotalguru/gplink/blob/master/connections/teradata.properties
And my blog:
http://www.pivotalguru.com/?page_id=982

cosmosdb - archive data older than n years into cold storage

I researched several places and could not find any direction on what options are there to archive old data from cosmosdb into a cold storage. I see for DynamoDb in AWS it is mentioned that you can move dynamodb data into S3. But not sure what options are for cosmosdb. I understand there is time to live option where the data will be deleted after certain date but I am interested in archiving versus deleting. Any direction would be greatly appreciated. Thanks
I don't think there is a single-click built-in feature in CosmosDB to achieve that.
Still, as you mentioned appreciating any directions, then I suggest you consider DocumentDB Data Migration Tool.
Notes about Data Migration Tool:
you can specify a query to extract only the cold-data (for example, by creation date stored within documents).
supports exporting export to various targets (JSON file, blob
storage, DB, another cosmosDB collection, etc..),
compacts the data in the process - can merge documents into single array document and zip it.
Once you have the configuration set up you can script this
to be triggered automatically using your favorite scheduling tool.
you can easily reverse the source and target to restore the cold data to active store (or to dev, test, backup, etc).
To remove exported data you could use the mentioned TTL feature, but that could cause data loss should your export step fail. I would suggest writing and executing a Stored Procedure to query and delete all exported documents with single call. That SP would not execute automatically but could be included in the automation script and executed only if data was exported successfully first.
See: Azure Cosmos DB server-side programming: Stored procedures, database triggers, and UDFs.
UPDATE:
These days CosmosDB has added Change feed. this really simplifies writing a carbon copy somewhere else.

Spark newbie (ODBC/SparkSQL)

I have a spark cluster setup and tried both native scala and spark sql on my dataset and the setup seems to work for the most part. I have the following questions
From an ODBC/extenal connectivity to the cluster, what should i expect?
- the admin/developer shapes the data and persists/caches a few RDDs that will be exposed? (Thinking on the lines of hive tables)
- What would be the equivalent of connecting to a "Hive metastore" in spark/spark sql?
Is thinking along the lines of hive faulted?
My other question was
- when i issue hive queries, (and say create tables and such), it uses the same hive metastore as hadoop/hive
- Where do the tables get created when i issue sql queries using sqlcontext?
- If i persist the table, it is the same concept as persisting an RDD?
Appreciate your answers
Nithya
(this is written with spark 1.1 in mind, be aware that new features tend to be added quickly, some limitations mentioned below might very well disappear at some point in the future).
You can use Spark SQL with Hive syntax and connect to Hive metastore, which will result in your Spark SQL hive commands to be executed on the same data space as if they were executed through Hive directly.
To do that you simply need to instantiate a HiveContext as explained here and provide a hive-site.xml configuration file that specifies, among other things, where to find the Hive metastore.
The result of a SELECT statement is a SchemaRDD, which is an RDD of Row objects that has an associated schema. You can use it just like you use any RDD, including cache and persist and the effect is the same (the fact that the data comes from hive has not influence here).
If your hive command is creating data, e.g. "CREATE TABLE ... ", the corresponding table gets created in exactly the same place as with regular Hive, i.e. /var/lib/hive/warehouse by default.
Executing Hive SQL through Spark provides you with all the caching benefits of Spark: executing a 2nd SQL query on the same data set within the same spark context will typically be much faster than the first query.
Since Spark 1.1, it is possible to start the Thrift JDBC server, which is essentially an equivalent to HiveServer2 and thus allows you to execute SparkQL commands through a JDBC connection.
Note that not all Hive features are available (yet?), see details here.
Finally, you can also discard Hive syntax and metastore and execute SQL queries directly on CSV and Parquet files. My best guess is that this will become the preferred approach in the future, although at the moment the set of SQL features available like this is smaller than when using the Hive syntax.

Resources