Which Aurora table store DAG variable information? - airflow

I have an airflow DAG which call a particular bash command using a variable. At the backend, we have Aurora DB. Do we know if there are any tables in the Aurora DB which stores information of the variables used in Airflow DAGs? I need to create a report out of it and hence, the ask to access the variables from backend.
I tried using the operational_insights schema but could not find any tables with the desired information.

If you are using an Airflow variable you should be able to query a list of them with the REST API no matter which backend you use.
curl "http://<your Airflow host>/api/v1/variables" --user "login:password"
This is preferred over querying the Airflow metadata database directly because if you accidentally modify or drop a table you can corrupt your Airflow.
With that caveat: the standard table where Airflow variables are stored is variable so after logging into the db SELECT * FROM variable; should return a list.
Again this is for Airflow Variables. From your question I am not entirely sure if you mean that or in general any variables that tasks use. In the latter case you might be looking for the rendered_fields parameter of the task instances, which can also be done using the API.

Related

How to parameterize SQL in airflow based on its run schedule

In airflow, we can perform SQL operations in databases like MySQL, PostgreSQL or cloud database like BigQuery.
We can also pass the parameter using user_defined_macros in SQL which will replace it with certain values. Eg. parameterize the database/schema name to avoid 2 different versions of sql w.r.t Dev/QA/Prod environment.
However, is there any way using which we can further optimize it for different schedule provided SQL's are same.
Eg.
For regular run use table : Dev_A/QA_A/Prod_A
For snapshot run use table : Dev_B/QA_B/Prod_B
This will help us to avoid 2 different versions of SQL for regular and snapshot run.

Is it possible to generate jobs in Dagster dynamically using configuration from database

Currently, my database has multi departments. I need to apply a data pipeline to all of these departments with different configurations.
I want to load configurations for each department from a database. Then use these configuration to generate a list of Jobs in Dagster.
For example, I have 3 tenants:
Department1: Configuration1
Department2: Configuration2
Department3: Configuration3
These information is stored in my database.
How can I load these information and dynamically create 3 jobs (pipelines):
Pipeline1 for Department1 with Configuration1
Pipeline2 for Department2 with Configuration2
Pipeline3 for Department3 with Configuration3
Is it possible to do it on Dagster? I can do it with Airflow (dynamically generating DAGs) but not sure how to do this in Dagster. I cannot load database configuration outside of op/job in Dagster.
In Dagster, your #repository function is just a regular function, so you can run arbitrary code in there to query your database and generate jobs dynamically:
#repository
def my_repo():
configs = # some query to your database
jobs = []
for config in configs:
jobs.append(get_job_for_my_config(config))
return jobs
If you expect that the database call might be somewhat expensive in terms of time, you can look into making your repository lazy-loaded, which the Dagster RepositoryDefinition docs detail how to do.

Airflow Metadata DB = airflow_db?

I have a project requirement to back-up Airflow Metadata DB to some data warehouse (but not using an Airflow DAG). At the same time, the requirement mentions some connection called airflow_db.
I am quite new to Airflow, so I googled a bit on the topic. I am a bit confused about this part. Our Airflow Metadata DB is PostgreSQL (this is built from docker-compose, so I am tinkering on a local install), but when I look at Connections in Airflow Web UI, it says airflow_db is MySQL.
I initially assumed that they are the same, but by the looks of it, they aren't? Can someone explain the difference and what they are for?
Airflow creates airflow_db Conn Id with MySQL by default (see source code)
Default connections are not really useful in production system. It's just a long list of stuff that you are probably not going to use.
Airflow 1.1.10 introduced the ability not to create the default list by setting:
load_default_connections = False in airflow.cfg (See PR)
To give more background the connection list is where hooks find the information needed in order to connect to a service. It's not related to the backend database. Though the backend is db like any db and if you wish to allow hooks to interact with it you can define it in the list like any other connection (which is probably why you have this as option in the default).

How to create Hive connection in airflow to specific DB?

I am trying to create Hive connection in airflow to point to specific Database. I tried to find the params in HiveHook and tried the below in the extra options.
{"db":"test_base"} {"schema":"test_base"} and {"database":"test_base"}
But looks like nothing works and always points to default db.
could someone help me to pointout what are the possible parameters we can pass in extra_options ?

Move and transform data between databases using Airflow

Using airflow, I extract data from a MySQL database, transform it with python and load it into a Redshift cluster.
Currently I use 3 airflow tasks : they pass the data by writing CSV on local disk.
How could I do this without writing to disk ?
Should I write one big task in python ? ( That would lower visibility )
Edit: this is a question about Airflow, and best practice for choosing the granularity of tasks and how to pass data between them.
It is not a general question about data migration or ETL. In this question ETL is only used as an exemple of workload for airflow tasks.
There are different ways you can achieve this:
If you are using AWS RDS service for MySQL, you can use AWS Data Pipeline to transfer data from MySQL to Redshift. They have inbuilt template in AWS Data Pipeline to do that. You can even schedule the incremental data transfer from MySQL to Redshift
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-redshift.html
How large is your table?
If your table is not too large and you can read the whole table into python using Pandas DataFrame or tuples and then transfer it Redshift.
Even if you have large table still you can read that table in chunks and push each chunk to Redshift.
Pandas are little inefficient in terms of memory usage if you read table into it.
Creating different tasks in Airflow will not help much. Either you can create a single function and call that function in dag using PythonOperator or create a python script and execute it using BashOperator in dag
One possibility is using the GenericTransfer operator from airflow. See docs
This only works with smallish datasets and the mysqlhook of airflow uses MySQLdb which does not support python 3.
Otherwise, I dont think there are other options, when using airflow, than writing to disk.
How large is your database?
Your approach of writing CSV on a local disk is optimal with a small database, so if this is the case you can write a Python task for that.
As the database get larger there will be more COPY commands and error prone uploading because you’re dealing with billions of rows of data spread across multiple MySQL tables.
You will also have to figure out exactly in which CSV file something went wrong.
It is also important to determine whether you need high throughput, high latency or frequent schema changes.
In conclusion, you should consider a third-party option like Alooma to extract data from a MySQL database and load it into your Redshift cluster.
I have done similar task before, but my system was in GCP.
What I did there was to write the data queried out into AVRO files, which can be easily (and very efficiently) be ingested into BigQuery.
So there is one task in the dag to query out the data and write to an AVRO file in Cloud Storage (S3 equivalent). And one task after that to call BigQuery operator to ingest the AVRO file.
You can probably do similar with csv file in S3 bucket, and then RedShift COPY command from the csv file in S3. I believe RedShift COPY from file in S3 is the fastest way to ingest data into RedShift.
These tasks are implemented as PythonOperators in Airflow.
You can pass information between tasks using XCom. You can read up on it in the documentation and there is also an example in the set of sample DAGs installed with Airflow by default.

Resources