How to parameterize SQL in airflow based on its run schedule - airflow

In airflow, we can perform SQL operations in databases like MySQL, PostgreSQL or cloud database like BigQuery.
We can also pass the parameter using user_defined_macros in SQL which will replace it with certain values. Eg. parameterize the database/schema name to avoid 2 different versions of sql w.r.t Dev/QA/Prod environment.
However, is there any way using which we can further optimize it for different schedule provided SQL's are same.
Eg.
For regular run use table : Dev_A/QA_A/Prod_A
For snapshot run use table : Dev_B/QA_B/Prod_B
This will help us to avoid 2 different versions of SQL for regular and snapshot run.

Related

Which Aurora table store DAG variable information?

I have an airflow DAG which call a particular bash command using a variable. At the backend, we have Aurora DB. Do we know if there are any tables in the Aurora DB which stores information of the variables used in Airflow DAGs? I need to create a report out of it and hence, the ask to access the variables from backend.
I tried using the operational_insights schema but could not find any tables with the desired information.
If you are using an Airflow variable you should be able to query a list of them with the REST API no matter which backend you use.
curl "http://<your Airflow host>/api/v1/variables" --user "login:password"
This is preferred over querying the Airflow metadata database directly because if you accidentally modify or drop a table you can corrupt your Airflow.
With that caveat: the standard table where Airflow variables are stored is variable so after logging into the db SELECT * FROM variable; should return a list.
Again this is for Airflow Variables. From your question I am not entirely sure if you mean that or in general any variables that tasks use. In the latter case you might be looking for the rendered_fields parameter of the task instances, which can also be done using the API.

Efficient method to move data from Oracle (SQL Developer) to MS SQL Server

Daily, I query a few tables in SQL Developer, filtering to prior day activity, adding column to date stamp the data, then export to xlsx. Then I manually import each file to a MS SQL Server via SQL Server Import and Export Wizard. Takes many clicks, much waiting...
I'm essentially creating an archive in SQL Server, the application I'm querying overwrites data daily. I'm not a DBA of either database, I use the archived data to do validations and research.
It's tough to get my org to provide additional software, I've been trying to make this work via SQL Developer, SSMS Express ed, and other standard tools.
I'm looking to make this reasonably automated, either via scripts, scheduled tasks, etc. Appreciate suggestions that would work on my current situation, but if that isn't reasonable, and there's a very reasonable alternative, I can go back to the org to request software/access/assistance.
You can use SSIS to import the data directly from Oracle to SQL Server, unless you need the .xlsx files for another purpose. You can also export from Oracle to these, then load to SQL Server from these files if you do need the files. For the date stamp column, a Derived Column can be added within a Data Flow Task using the SSIS GETDATE() function for a timestamp in order to achieve the same result. This function returns a timestamp, and if only the date is necessary the (DT_DBDATE) function can cast it to a date data type that's compatible with this data type of SQL Server. Once you have the SSIS package configured, you can schedule in to run at regular intervals as a SQL Agent job. I'd also recommend installing the SSIS catalog (SSISDB) and using this the source to run the packages from. The following links shed more light on these areas.
SSIS
Connecting to Oracle from SSIS
Data Flow Task
Derived Column Transformation
Creating SQL Server Agent Jobs for SSIS packages
SSIS Catalog
Another option that you may consider (if it is supported in SQL Express) is using the BCP utility, which can be run from command line.
The BCP utility allows you to bulk copy the data from a delimited text file into a SQL Server table.
If you go this approach, things to consider:
Number of Columns in the source file need to match the number of columns in the destination
Data types must match (or be comparable)
Typically, empty strings will be converted to nulls, so you will need to consider if the columns are nullable.
(to name a few - if you want to delve deeper, you might also need to look at custom delimiters between fields and records. Don't forget, commas and line feeds are still valid characters in char type fields).
Anyhow, maybe it will work for you, maybe not. Sure, you might still have to deal with the exporting of the data from Oracle, but it might ease the pain getting the data in.
Have a read:
https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-2017

Move and transform data between databases using Airflow

Using airflow, I extract data from a MySQL database, transform it with python and load it into a Redshift cluster.
Currently I use 3 airflow tasks : they pass the data by writing CSV on local disk.
How could I do this without writing to disk ?
Should I write one big task in python ? ( That would lower visibility )
Edit: this is a question about Airflow, and best practice for choosing the granularity of tasks and how to pass data between them.
It is not a general question about data migration or ETL. In this question ETL is only used as an exemple of workload for airflow tasks.
There are different ways you can achieve this:
If you are using AWS RDS service for MySQL, you can use AWS Data Pipeline to transfer data from MySQL to Redshift. They have inbuilt template in AWS Data Pipeline to do that. You can even schedule the incremental data transfer from MySQL to Redshift
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-redshift.html
How large is your table?
If your table is not too large and you can read the whole table into python using Pandas DataFrame or tuples and then transfer it Redshift.
Even if you have large table still you can read that table in chunks and push each chunk to Redshift.
Pandas are little inefficient in terms of memory usage if you read table into it.
Creating different tasks in Airflow will not help much. Either you can create a single function and call that function in dag using PythonOperator or create a python script and execute it using BashOperator in dag
One possibility is using the GenericTransfer operator from airflow. See docs
This only works with smallish datasets and the mysqlhook of airflow uses MySQLdb which does not support python 3.
Otherwise, I dont think there are other options, when using airflow, than writing to disk.
How large is your database?
Your approach of writing CSV on a local disk is optimal with a small database, so if this is the case you can write a Python task for that.
As the database get larger there will be more COPY commands and error prone uploading because you’re dealing with billions of rows of data spread across multiple MySQL tables.
You will also have to figure out exactly in which CSV file something went wrong.
It is also important to determine whether you need high throughput, high latency or frequent schema changes.
In conclusion, you should consider a third-party option like Alooma to extract data from a MySQL database and load it into your Redshift cluster.
I have done similar task before, but my system was in GCP.
What I did there was to write the data queried out into AVRO files, which can be easily (and very efficiently) be ingested into BigQuery.
So there is one task in the dag to query out the data and write to an AVRO file in Cloud Storage (S3 equivalent). And one task after that to call BigQuery operator to ingest the AVRO file.
You can probably do similar with csv file in S3 bucket, and then RedShift COPY command from the csv file in S3. I believe RedShift COPY from file in S3 is the fastest way to ingest data into RedShift.
These tasks are implemented as PythonOperators in Airflow.
You can pass information between tasks using XCom. You can read up on it in the documentation and there is also an example in the set of sample DAGs installed with Airflow by default.

Spark newbie (ODBC/SparkSQL)

I have a spark cluster setup and tried both native scala and spark sql on my dataset and the setup seems to work for the most part. I have the following questions
From an ODBC/extenal connectivity to the cluster, what should i expect?
- the admin/developer shapes the data and persists/caches a few RDDs that will be exposed? (Thinking on the lines of hive tables)
- What would be the equivalent of connecting to a "Hive metastore" in spark/spark sql?
Is thinking along the lines of hive faulted?
My other question was
- when i issue hive queries, (and say create tables and such), it uses the same hive metastore as hadoop/hive
- Where do the tables get created when i issue sql queries using sqlcontext?
- If i persist the table, it is the same concept as persisting an RDD?
Appreciate your answers
Nithya
(this is written with spark 1.1 in mind, be aware that new features tend to be added quickly, some limitations mentioned below might very well disappear at some point in the future).
You can use Spark SQL with Hive syntax and connect to Hive metastore, which will result in your Spark SQL hive commands to be executed on the same data space as if they were executed through Hive directly.
To do that you simply need to instantiate a HiveContext as explained here and provide a hive-site.xml configuration file that specifies, among other things, where to find the Hive metastore.
The result of a SELECT statement is a SchemaRDD, which is an RDD of Row objects that has an associated schema. You can use it just like you use any RDD, including cache and persist and the effect is the same (the fact that the data comes from hive has not influence here).
If your hive command is creating data, e.g. "CREATE TABLE ... ", the corresponding table gets created in exactly the same place as with regular Hive, i.e. /var/lib/hive/warehouse by default.
Executing Hive SQL through Spark provides you with all the caching benefits of Spark: executing a 2nd SQL query on the same data set within the same spark context will typically be much faster than the first query.
Since Spark 1.1, it is possible to start the Thrift JDBC server, which is essentially an equivalent to HiveServer2 and thus allows you to execute SparkQL commands through a JDBC connection.
Note that not all Hive features are available (yet?), see details here.
Finally, you can also discard Hive syntax and metastore and execute SQL queries directly on CSV and Parquet files. My best guess is that this will become the preferred approach in the future, although at the moment the set of SQL features available like this is smaller than when using the Hive syntax.

SqLite Multicore Processing

How do you configure SqLite 3 to process a single query using more than 1 core of a CPU ?
Since version 3.8.7, SQLite can use multiple threads for parallel sorting of large data sets.
sqlite3 itself does not do that.
However, I have a project called multicoresql on github that has utility programs and a C library for spreading sql queries onto multiple cores.
It uses sharding so you have to break your large database or datafile into multiple sqlite3 database files. A single SQL query must be written as two SQL queries, a map query that first runs on all the shards, and a reduce query to determine the result from the collected output from all the shards running the map query.

Resources