How to do store sql output to pandas dataframe using Airflow? - airflow

I want to store data from SQL to Pandas dataframe and do some data transformations and then load to another table suing airflow
Issue that I am facing is that connection string to tables are accessbale only through Airflow. So I need to use airflow as medium to read and write data.
How can this be done ?
MY code
Task1 = PostgresOperator(
task_id='Task1',
postgres_conn_id='REDSHIFT_CONN',
sql="SELECT * FROM Western.trip limit 5 ",
params={'limit': '50'},
dag=dag
The output of task needs to be stored to dataframe (df) and after tranfromations and load back into another table.
How can this be done?

I doubt there's an in-built operator for this. You can easily write a custom operator
Extend PostgresOperator or just BaseOperator / any other operator of your choice. All custom code goes into the overridden execute() method
Then use PostgresHook to obtain a Pandas DataFrame by invoking get_pandas_df() function
Perform whatever transformations you have to do in your pandas df
Finally use insert_rows() function to insert data into table
UPDATE-1
As requested, I'm hereby adding the code for operator
from typing import Dict, Any, List, Tuple
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.utils.decorators import apply_defaults
from pandas import DataFrame
class MyCustomOperator(PostgresOperator):
#apply_defaults
def __init__(self, destination_table: str, *args, **kwargs):
super().__init__(*args, **kwargs)
self.destination_table: str = destination_table
def execute(self, context: Dict[str, Any]):
# create PostgresHook
self.hook: PostgresHook = PostgresHook(postgres_conn_id=self.postgres_conn_id,
schema=self.database)
# read data from Postgres-SQL query into pandas DataFrame
df: DataFrame = self.hook.get_pandas_df(sql=self.sql, parameters=self.parameters)
# perform transformations on df here
df['column_to_be_doubled'] = df['column_to_be_doubled'].multiply(2)
..
# convert pandas DataFrame into list of tuples
rows: List[Tuple[Any, ...]] = list(df.itertuples(index=False, name=None))
# insert list of tuples in destination Postgres table
self.hook.insert_rows(table=self.destination_table, rows=rows)
Note: The snippet is for reference only; it has NOT been tested
References
Pandas convert DataFrame into Array of tuples
Further modifications / improvements
The destination_table param can be read from Variable
If the destination table doesn't necessarily reside in same Postgres schema, then we can take another param like destination_postgres_conn_id in __init__ and use that to create a destination_hook on which we can invoke insert_rows method

Here is a very simple and basic example to read data from a database into a dataframe.
# Get the hook
mysqlserver = MySqlHook("Employees")
# Execute the query
df = mysqlserver.get_pandas_df(sql="select * from employees LIMIT 10")
Kudos to y2k-shubham for the get_pandas_df() tip.
I also save the dataframe to file to pass it to the next task (this is not recommended when using clusters since the next task could be executed on a different server)
This full code should work as it is.
from airflow import DAG
from airflow.operators.python import PythonOperator,
from airflow.utils.dates import days_ago
from airflow.hooks.mysql_hook import MySqlHook
dag_id = "db_test"
args = {
"owner": "airflow",
}
base_file_path = "dags/files/"
def export_func(task_instance):
import time
# Get the hook
mysqlserver = MySqlHook("Employees")
# Execute the query
df = mysqlserver.get_pandas_df(sql="select * from employees LIMIT 10")
# Generate somewhat unique filename
path = "{}{}_{}.ftr".format(base_file_path, dag_id, int(time.time()))
# Save as a binary feather file
df.to_feather(path)
print("Export done")
# Push the path to xcom
task_instance.xcom_push(key="path", value=path)
def import_func(task_instance):
import pandas as pd
# Get the path from xcom
path = task_instance.xcom_pull(key="path")
# Read the binary file
df = pd.read_feather(path)
print("Import done")
# Do what you want with the dataframe
print(df)
with DAG(
dag_id,
default_args=args,
schedule_interval=None,
start_date=days_ago(2),
tags=["test"],
) as dag:
export_task = PythonOperator(
task_id="export_df",
python_callable=export_func,
)
import_task = PythonOperator(
task_id="import_df",
python_callable=import_func,
)
export_task >> import_task

Related

How do I update values in streamlit on a schedule?

I have a simple streamlit app that is meant to show the latest data. I want it to re-fresh the data every 5 seconds, but right now the only way I found to do that is via a st.experimental_refresh; this is the core code:
import streamlit as st
import time
current_time = int(time.time())
if 'last_run' not in st.session_state:
st.session_state['last_run'] = current_time
#st.experimental_singleton
def load_data():
...
return data
data = load_data()
if current_time > st.session_state['last_run']+5: # check every 5 seconds
load_data.clear() # clear cache
st.session_state['last_run'] = current_time
st.experimental_rerun()
However, the st.experimental_rerun() makes the user experience terrible; are there any other ideas?
You can try using schedule and st.empty().
Example:
import time
from schedule import every, repeat, run_pending
def load_data():
...
return data
def main():
data = load_data()
# Do something with data
with st.empty():
#repeat(every(5).seconds)
def refresh_data():
main()
while True:
run_pending()
time.sleep(1)

Pull list xcoms in TaskGroups not working

My airflow code has the below Python Operator callable where I am creating a list and pushing it to xcoms:
keys = []
values = []
def attribute_count_check(e_run_id,**context):
job_run_id = int(e_run_id)
da = "select count (distinct row_num) from dds_metadata.dds_temp_att_table where run_id ={}".format(job_run_id)
cursor.execute(da)
res = cursor.fetchall()
view_res = [x for res in res for x in res]
count_of_sql = view_res[0]
print(count_of_sql)
if count_of_sql < 1:
print("deleting of cluster")
return 'delete_cluster'
else :
print("triggering attr_check")
num_attributes_per_task = num_attr #job_config
diff = math.ceil (count_of_sql / num_attributes_per_task)
instance = int(diff)
n = num_attributes_per_task
global values
global keys
for r in range(1, instance+1):
#a = r
keys.append(r)
lower_ranges =(n*(r-1)) +1
upper_range = (n*(r - 1)) + n
b =(lower_ranges,upper_range)
values.append(b)
task_instance = context['task_instance']
task_instance.xcom_push(key="di_keys", value=keys)
task_instance.xcom_push(key="di_values", value=values)
The xcoms from the job is as in the below screenshot :
Now I am trying to fetch the values from xcoms to create cluster dynamically with the code below:
with TaskGroup('dataproc_create_cluster',prefix_group_id=False) as dataproc_create_clusters:
for i in zip('{{ ti.xcom_pull(key="di_keys")}}','{{ ti.xcom_pull(key="di_values")}}'):
dynmaic_create_cluster = DataprocCreateClusterOperator(
task_id="create_cluster_{}".format(list(eval(str(i)))[0]),
project_id='{0}'.format(PROJECT),
cluster_config=CLUSTER_GENERATOR_CONFIG,
region='{0}'.format(REGION),
cluster_name="dataproc-cluster-{}-sit".format(str(i[0])),
)
But I am getting the below error:
Broken DAG: [/opt/airflow/dags/Cluster_config.py] Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models/baseoperator.py", line 547, in __init__
validate_key(task_id)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/helpers.py", line 56, in validate_key
"dots and underscores exclusively".format(k=k)
airflow.exceptions.AirflowException: The key (create_cluster_{) has to be made of alphanumeric characters, dashes, dots and underscores exclusively
So I changed the task_id as below:
task_id="create_cluster_"+re.sub(r'\W+', '', str(list(eval(str(i)))[0])),
After which I got the below error:
airflow.exceptions.DuplicateTaskIdFound: Task id 'create_cluster_' has already been added to the DAG
This made me think that the value in Xcoms is being parsed one literal at a time, so I used render_template_as_native_obj=True, .
But I am still getting the duplicate task id error
Regarding the jinja2 templating outside of templated fields
First, you can only use jinja2 templating in templated fields. Simply said, there are two processes. One is parsing the DAG (which happens first), the other is executing the tasks. At the moment your DAG is parsed, no tasks have run yet and there is no TaskInstance available, and thus also no XCOM pull available. However, with templated fields, you can use jinja2 templating for which the value of the fields are computed at the moment your task executes. At that point, the TaskInstance and the XCOM pull is available.
For example, in a PythonOperator you can use the following templated fields;
template_fields: Sequence[str] = ('templates_dict', 'op_args', 'op_kwargs')
Changing the number of tasks based on a result of a task.
Second, you can not change the number of tasks it contains based on the output of a task. Airflow simply does not support this. There is one exception; which is using mapped tasks. There is a nice example in the docs that I copied here;
#task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
#task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())

How to create dynamic task mapping for postgres operator in airflow?

I am completely new to AirFlow and I am trying to create 8 tasks which are pretty simillar.
I've read about expand() methond though I am not quite sure how to use it for PostgresOperator?
So I have this task:
t1 = PostgresOperator(
task_id='load_something_1',
postgres_conn_id="postgres_default",
sql = "SELECT somefunction_1()",
dag=dag)
I need to create similar tasks only they gotta have load_something_2, load_something_3 etc. and
SELECT somefucntion_2, SELECT somefucntion_3 etc.
How do I do this using dynamic task mapping ?
Thank you beforehand!
It's hard to say whether you need expand() or not without knowing what your iterator looks like, and how the data is made available to the DAG, but here's how this could be accomplished with a simple iterator in a full-example DAG:
from datetime import datetime
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.decorators import dag, task
#dag(
default_args={
'owner': 'me'
},
dag_id=f'example-dag',
start_date=datetime(2023,1,6),
schedule_interval=None,
)
def workflow():
#task
def load_something(i):
t1 = PostgresOperator(
task_id=f'load_something',
postgres_conn_id="postgres_default",
sql = f"SELECT somefunction_{i}()",
)
my_tasks = [load_something(i) for i in range(1,9)]
# my_tasks = [load_something.override(task_id=f'load_something_{i+1}')(i) for i in range(1,9)]
my_tasks
workflow()
Note: just calling your task like my_tasks = [load_something(i) for i in range(1,9)] with the #task decorator will automatically enumerate your task names for you: if you want to explicitly name the tasks, you can do so using the override() method. Uncomment out my_tasks = [load_something.override(task_id=f'load_something_{i}')(i) for i in range(1,9)]:

combine BranchPythonOperator and PythonVirtualenvOperator

I have a PythonVirtualenvOperator which reads some data from a database - if there is no new data, then the DAG should end there, otherwise it should call additional tasks e.g
#dag.py
load_data >>[if_data,if_no_data]>>another_task>>last_task
I understand that it can be done using PythonBranchOperator but I can't see how I can combine the venv and the branch-operator.
Is it doable?
This can be solved using Xcom.
load_date can push the number of records it processed (new data).
Your pipe can be:
def choose(**context):
value = context['ti'].xcom_pull(task_ids='load_data')
if int(value)>0:
return 'if_data'
return 'if_no_data'
branch = BranchPythonOperator(
task_id='branch_task',
provide_context=True, # Remove this line if Airflow>=2.0.0
python_callable=choose)
load_data >> branch >>[if_data,if_no_data]>>another_task>>last_task

Airflow: How to push xcom value from BigQueryOperator?

This is my operator:
bigquery_check_op = BigQueryOperator(
task_id='bigquery_check',
bql=SQL_QUERY,
use_legacy_sql = False,
bigquery_conn_id=CONNECTION_ID,
trigger_rule='all_success',
xcom_push=True,
dag=dag
)
When I check the Render page in the UI. Nothing appears there.
When I run the SQL in the console it return value 1400 which is correct.
Why the operator doesn't push the XCOM?
I can't use BigQueryValueCheckOperator. This operator is designed to FAIL against a check of value. I don't want nothing to fail. I simply want to branch the code based on the return value from the query.
Here is how you might be able to accomplish this with the BigQueryHook and the BranchPythonOperator:
from airflow.operators.python_operator import BranchPythonOperator
from airflow.contrib.hooks import BigQueryHook
def big_query_check(**context):
sql = context['templates_dict']['sql']
bq = BigQueryHook(bigquery_conn_id='default_gcp_connection_id',
use_legacy_sql=False)
conn = bq.get_conn()
cursor = conn.cursor()
results = cursor.execute(sql)
# Do something with results, return task_id to branch to
if results == 0:
return "task_a"
else:
return "task_b"
sql = "SELECT COUNT(*) FROM sales"
branching = BranchPythonOperator(
task_id='branching',
python_callable=big_query_check,
provide_context= True,
templates_dict = {"sql": sql}
dag=dag,
)
First we create a python callable that we can use to execute the query and select which task_id to branch too. Second, we create the BranchPythonOperator.
The simplest answer is because xcom_push is not one of the params in BigQueryOperator nor BaseOperator nor LoggingMixin.
The BigQueryGetDataOperator does return (and thus push) some data but it works by table and column name. You could chain this behavior by making the query you run output to a uniquely named table (maybe use {{ds_nodash}} in the name), and then use the table as a source for this operator, and then you can branch with the BranchPythonOperator.
You might instead try to use the BigQueryHook's get_conn().cursor() to run the query and work with some data inside the BranchPythonOperator.
Elsewhere we chatted and came up with something along the lines of this for putting in the callable of a BranchPythonOperator:
cursor = BigQueryHook(bigquery_conn_id='connection_name').get_conn().cursor()
# one of these two:
cursor.execute(SQL_QUERY) # if non-legacy
cursor.job_id = cursor.run_query(bql=SQL_QUERY, use_legacy_sql=False) # if legacy
result=cursor.fetchone()
return "task_one" if result[0] is 1400 else "task_two" # depends on results format

Resources