I am new to Airflow. My requirement is to develop a data pipeline to do below:
i) Connect and download multiple csv files (filename starting with name) airflow from given SFTP folder to S3
ii) Ingest files from S3 to redshift
I have used SFTPToS3Operator and S3ToRedshiftOperator to do this and it's working perfectly fine for one file when I provide full file name for sftp_path and s3_key,
but it's failing with IO error if I provide wild card (airflow*.csv) for file name in sftp_path and s3_key.
Need your expert opinion on approach to download multiple files from SFTP to s3 and ingesting into redshift
I am using below code for single file, which is working fine for single file
# Dag initialization with required parameter
with DAG("sftp_to_redshift_data_ingestion", start_date=datetime(2021, 1 ,1),
schedule_interval="#daily", default_args=default_args, catchup=False) as dag:
# SFTP to S3 data transfer
sftp_to_s3 = SFTPToS3Operator(
task_id="sftp_to_s3",
sftp_conn_id="sftp_conn",
sftp_path="/Inbound/airflow_demo.csv"
s3_conn_id="s3_dev_conn",
s3_bucket="s3_dev_bucket",
s3_key="input/test/airflow_demo.csv"
)
#S3 to Redshift data transfer
transfer_s3_to_redshift = S3ToRedshiftOperator(task_id='transfer_s3_to_redshift',
aws_conn_id = 's3_dev_conn',
redshift_conn_id = "redshift_dev_conn",
s3_bucket="s3_dev_bucket",
s3_key="input/test/airflow_demo.csv",
schema="test",
table="airflow_demo_table",
column_list=["customer_id", "customer_name", "customer_address"],
copy_options= ["csv", "DELIMITER ','", "FILLRECORD", "IGNOREHEADER 1", "COMPROWS 1000000"]
)
The SFTPToS3Operator only copies over one file at a time. You would need to first get a list of all the file names (metadata) from SFTP.
Loop through the files and run the SFTPtoS3 operator, it will copy all the files into S3.
EX:
files = ["file1.csv", "file2.csv"]
for file in files:
sftp_to_s3 = SFTPToS3Operator(
task_id="sftp_to_s3",
sftp_conn_id="sftp_conn",
sftp_path=f"/Inbound/{file}"
s3_conn_id="s3_dev_conn",
s3_bucket="s3_dev_bucket",
s3_key=f"input/test/{file}"
)
Related
I'm new to Snakemake and try to use specific files in a rule, from the directory() output of another rule that clones a git repo.
Currently, this gives me an error Wildcards in input files cannot be determined from output files: 'json_file', and I don't understand why. I have previously worked through the tutorial at https://carpentries-incubator.github.io/workflows-snakemake/index.html.
The difference between my workflow and the tutorial workflow is that I want to create the data I use later in the first step, whereas in the tutorial, the data was already there.
Workflow description in plain text:
Clone a git repository to path {path}
Run a script {script} on every single JSON files in the directory {path}/parsed/ in parallel to produce the aggregate result {result}
GIT_PATH = config['git_local_path'] # git/
PARSED_JSON_PATH = f'{GIT_PATH}parsed/'
GIT_URL = config['git_url']
# A single parsed JSON file
PARSED_JSON_FILE = f'{PARSED_JSON_PATH}{{json_file}}.json'
# Build a list of parsed JSON file names
PARSED_JSON_FILE_NAMES = glob_wildcards(PARSED_JSON_FILE).json_file
# All parsed JSON files
ALL_PARSED_JSONS = expand(PARSED_JSON_FILE, json_file=PARSED_JSON_FILE_NAMES)
rule all:
input: 'result.json'
rule clone_git:
output: directory(GIT_PATH)
threads: 1
conda: f'{ENVS_DIR}git.yml'
shell: f'git clone --depth 1 {GIT_URL} {{output}}'
rule extract_json:
input:
cmd='scripts/extract_json.py',
json_file=PARSED_JSON_FILE
output: 'result.json'
threads: 50
shell: 'python {input.cmd} {input.json_file} {output}'
Running only clone_git works fine (if I set an all input of GIT_PATH).
Why do I get the error message? Is this because the JSON files don't exist when the workflow is started?
Also - I don't know if this matters - this is a subworkflow used with module.
What you need seems to be a checkpoint rule which is first executed and only then snakemake determines which .json files are present and runs your extract/aggregate functions. Here's an example adapted:
I'm struggling to fully understand the file and folder structure you get after cloning your git repo. So I have fallen back to the best practices by Snakemake of using resources for downloaded and results for created files.
You'll need to re-adjust those paths to match your case again:
GIT_PATH = config["git_local_path"] # git/
GIT_URL = config["git_url"]
checkpoint clone_git:
output:
git=directory(GIT_PATH),
threads: 1
conda:
f"{ENVS_DIR}git.yml"
shell:
f"git clone --depth 1 {GIT_URL} {{output.git}}"
rule extract_json:
input:
cmd="scripts/extract_json.py",
json_file="resources/{file_name}.json",
output:
"results/parsed_files/{file_name}.json",
shell:
"python {input.cmd} {input.json_file} {output}"
def get_all_json_file_names(wildcards):
git_dir = checkpoints.clone_git.get(**wildcards).output["git"]
file_names = glob_wildcards(
"resources/{file_name}.json"
).file_name
return expand(
"results/parsed_files/{file_name}.json",
file_name=file_names,
)
# Rule has checkpoint dependency: Only after the checkpoint is executed
# the rule is executed which then evaluates the function to determine all
# json files downloaded from the git repo
rule aggregate:
input:
get_all_json_file_names
output:
"result.json",
default_target: True
shell:
# TODO: Action which combines all JSON files
edit: Moved the expand(...) from rule aggregate into get_all_json_file_names.
I have a project in google composser that aims to submit on a daily basis.
The code below does that, it works fine.
with models.DAG('reporte_prueba',
schedule_interval=datetime.timedelta(weeks=4),
default_args=default_dag_args) as dag:
make_bq_dataset = bash_operator.BashOperator(
task_id='make_bq_dataset',
# Executing 'bq' command requires Google Cloud SDK which comes
# preinstalled in Cloud Composer.
bash_command='bq ls {} || bq mk {}'.format(
bq_dataset_name, bq_dataset_name))
bq_audit_query = bigquery_operator.BigQueryOperator(
task_id='bq_audit_query',
sql=query_sql,
use_legacy_sql=False,
destination_dataset_table=bq_destination_table_name)
export_audits_to_gcs = bigquery_to_gcs.BigQueryToCloudStorageOperator(
task_id='export_audits_to_gcs',
source_project_dataset_table=bq_destination_table_name,
destination_cloud_storage_uris=[output_file],
export_format='CSV')
download_file = GCSToLocalFilesystemOperator(
task_id="download_file",
object_name='audits.csv',
bucket='bucket-reportes',
filename='/home/airflow/gcs/data/audits.csv',
)
email_summary = email_operator.EmailOperator(
task_id='email_summary',
to=['aa#bb.cl'],
subject="""Reporte de Auditorías Diarias
Institución: {institution_report} día {date_report}
""".format(date_report=date,institution_report=institution),
html_content="""
Sres.
<br>
Adjunto enviamos archivo con Reporte Transacciones Diarias.
<br>
""",
files=['/home/airflow/gcs/data/audits.csv'])
delete_bq_table = bash_operator.BashOperator(
task_id='delete_bq_table',
bash_command='bq rm -f %s' % bq_destination_table_name,
trigger_rule=trigger_rule.TriggerRule.ALL_DONE)
(
make_bq_dataset
>> bq_audit_query
>> export_audits_to_gcs
>> delete_bq_table
)
export_audits_to_gcs >> download_file >> email_summary
With this code, I create a table (which is later deleted) with the data that I need to send, then I pass that table to storage as a csv.
then I download the .csv to the local airflow directory to send it by mail.
The question I have is that if I can avoid the part of creating the table and taking it to storage. since I don't need it.
for example, execute the query with BigqueryOperator and access the result in ariflow, thereby generating the csv locally and then sending it.
I have the way to generate the CSV but my biggest doubt is how (if it is possible) to access the result of the query or pass the result to another airflow task
Though I wouldn't recommend passing results of sql queries across tasks, XComs in airflow are generally used for the communication between tasks.
https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html
Also you need to create a custom operator to return query results, as I "believe" BigQueryOperator doesn't return query results.
I have a main dag which retrieves a file and splits the data in this file to separate csv files.
I have another set of tasks that must be done for each file of these csv files. eg (Uploading to GCS, Inserting to BigQuery)
How can I generate a SubDag for each file dynamically based on the number of files? SubDag will define the tasks like Uploading to GCS, Inserting to BigQuery, deleting the csv file)
So right now, this is what it looks like
main_dag = DAG(....)
download_operator = SFTPOperator(dag = main_dag, ...) # downloads file
transform_operator = PythonOperator(dag = main_dag, ...) # Splits data and writes csv files
def subdag_factory(): # Will return a subdag with tasks for uploading to GCS, inserting to BigQuery.
...
...
How can I call the subdag_factory for each file generated in transform_operator?
I tried creating subdags dynamically as follows
# create and return and DAG
def create_subdag(dag_parent, dag_id_child_prefix, db_name):
# dag params
dag_id_child = '%s.%s' % (dag_parent.dag_id, dag_id_child_prefix + db_name)
default_args_copy = default_args.copy()
# dag
dag = DAG(dag_id=dag_id_child,
default_args=default_args_copy,
schedule_interval='#once')
# operators
tid_check = 'check2_db_' + db_name
py_op_check = PythonOperator(task_id=tid_check, dag=dag,
python_callable=check_sync_enabled,
op_args=[db_name])
tid_spark = 'spark2_submit_' + db_name
py_op_spark = PythonOperator(task_id=tid_spark, dag=dag,
python_callable=spark_submit,
op_args=[db_name])
py_op_check >> py_op_spark
return dag
# wrap DAG into SubDagOperator
def create_subdag_operator(dag_parent, db_name):
tid_subdag = 'subdag_' + db_name
subdag = create_subdag(dag_parent, tid_prefix_subdag, db_name)
sd_op = SubDagOperator(task_id=tid_subdag, dag=dag_parent, subdag=subdag)
return sd_op
# create SubDagOperator for each db in db_names
def create_all_subdag_operators(dag_parent, db_names):
subdags = [create_subdag_operator(dag_parent, db_name) for db_name in db_names]
# chain subdag-operators together
airflow.utils.helpers.chain(*subdags)
return subdags
# (top-level) DAG & operators
dag = DAG(dag_id=dag_id_parent,
default_args=default_args,
schedule_interval=None)
subdag_ops = create_subdag_operators(dag, db_names)
Note that the list of inputs for which subdags are created, here db_names, can either be declared statically in the python file or could be read from external source.
The resulting DAG looks like this
Diving into SubDAG(s)
Airflow deals with DAG in two different ways.
One way is when you define your dynamic DAG in one python file and put it into dags_folder. And it generates dynamic DAG based on external source (config files in other dir, SQL, noSQL, etc). Less changes to the structure of the DAG - better (actually just true for all situations). For instance, our DAG file generates dags for every record(or file), it generates dag_id as well. Every airflow scheduler's heartbeat this code goes through the list and generates the corresponding DAG. Pros :) not too much, just one code file to change. Cons a lot and it goes to the way Airflow works. For every new DAG(dag_id) airflow writes steps into database so when number of steps changes or name of the step it might break the web server. When you delete a DAG from your list it became kind of orphanage you can't access it from web interface and have no control over a DAG you can't see the steps, you can't restart and so on. If you have a static list of DAGs and IDes are not going to change but steps occasionally do this method is acceptable.
So at some point I've come up with another solution. You have static DAGs (they are still dynamic the script generates them, but their structure, IDes do not change). So instead of one script that walks trough the list like in directory and generates DAGs. You do two static DAGs, one monitors the directory periodically (*/10 ****), the other one is triggered by the first. So when a new file/files appeared, the first DAG triggers the second one with arg conf. Next code has to be executed for every file in the directory.
session = settings.Session()
dr = DagRun(
dag_id=dag_to_be_triggered,
run_id=uuid_run_id,
conf={'file_path': path_to_the_file},
execution_date=datetime.now(),
start_date=datetime.now(),
external_trigger=True)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
session.close()
The triggered DAG can receive the conf arg and finish all the required tasks for the particular file. To access the conf param use this:
def work_with_the_file(**context):
path_to_file = context['dag_run'].conf['file_path'] \
if 'file_path' in context['dag_run'].conf else None
if not path_to_file:
raise Exception('path_to_file must be provided')
Pros all the flexibility and functionality of Airflow
Cons the monitor DAG can be spammy.
I set up an Airflow server successfully. I want to run some test jobs but I am having trouble finding beginner guides which fit into what I am trying to do.
Current status:
Python scripts to download files from SFTP (any file which does not exist on local machine) or create a file from a queryout
Pandas scripts to read the data into memory, modify it in some way to prepare it for the database (look for new dimensions, remap, add calculations). Load data to appropriate table in database. Send email summaries (pandas to_html)
The logic I have for most of my scripts is based on if the file has not been processed, then process it. 'Processed' files are either organized by filename in a db table, or I move the file to a special processed folder.
The other logic I have is based on the date in the filename. I compare the dates of files which exist versus dates which should exist (a range of dates). If the file does not exist, then I create it (usually a BCP or PSQL query).
Do I just have Airflow run these .py files? Or should I alter my scripts to use some of the Airflow parameters/jinja templating?
I almost feel like I could use the BashOperator for almost everything. Would this work
dag_input = sys.argv[1]
def alter_table(query, engine=pg_engine):
fake_conn = engine.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.execute(query)
fake_conn.commit()
fake_cur.close()
query_list = [
f'SELECT * from table_1 where report_date = \'{dag_input}\'',
f'SELECT * from table_2 where report_date = \'{dag_input}\'',
]
for value in query_list:
alter_table(value)
Then the dag would be something like this, with an airflow parameter used for the sys.argv?
templated_command = """
python download_raw.py "{{ ds }}"
"""
t3 = BashOperator(
task_id='download_raw',
bash_command=templated_command,
dag=dag)
Since the code for this task is in python, I would use a PythonOperator.
Put a method in download_raw.py that takes **kwargs as parameters and you have access to everything in the context.
from download_raw import my_func
t3 = PythonOperator(
task_id='download_raw',
python_callable=my_func,
dag=dag)
#inside download_raw.py
def my_func(**kwargs):
context = kwargs
ds = context['ds']
... (do your logic here)
I would do it like this or your bash command could get hideous when several pieces of the context.
Is it possible to execute multiple queries at the same time in impala ? If yes, how does impala handle it?
I would certainly do some tests on your own, but I was not able to get multiple queries to execute:
I was using Impala connection, and reading query from a .sql file. This works for single commands.
from impala.dbapi import connect
# actual server and port changed for this post for security
conn=connect(host='impala server', port=11111,auth_mechanism="GSSAPI")
cursor = conn.cursor()
cursor.execute((open("sandbox/z_temp.sql").read()))
This is the error I received.
HiveServer2Error: AnalysisException: Syntax error in line 2:
This is what the SQL looked like in the .sql file.
Select * FROM database1.table1;
Select * FROM database1.table2;
I was able to run multiple commands with the SQL commands in separate .sql files iterating over all .sql files in a specified folder.
#Create list of file names for recon .sql files this will be sorted
#Numbers at begining of filename are important to sort so that files will be executed in correct order
file_names = glob.glob('folder/.sql')
asc_names = sorted(file_names, reverse = False)
filename = ""
for file_name in asc_names:
str_filename = str(file_name)
print(filename)
query = (open(str_filename).read())
cursor = conn.cursor()
# creates an error log dataframe to print, or write to file at end of job.
try:
# Each SQL command must be executed seperately
cursor.execute(query)
df_id= pd.DataFrame([{'test_name': str_filename[-40:], 'test_status': 'PASS'}])
df_log = df_log.append(df_id, ignore_index=True)
except:
df_id= pd.DataFrame([{'test_name': str_filename[-40:], 'test_status': 'FAIL'}])
df_log = df_log.append(df_id, ignore_index=True)
continue
Another way to do this would be to have all of the SQL statements in one .sql file separated by ; then loop thru the .sql file splitting statements out by ; running one at a time.
from impala.dbapi import connect
from impala.util import as_pandas
conn=connect(host='impalaserver', port=11111, auth_mechanism='GSSAPI')
cursor = conn.cursor()
# split SQL statements from one file seperated by ';', Note: last command will not have semicolon at end.
sql_file = open("sandbox/temp.sql").read()
sql = sql_file.split(';')
for cmd in sql:
# This gets rid of the non printing characters you may have
cmd = cmd.replace('/r','')
cmd = cmd.replace('/n','')
# This runs your SQL commands one at a time.
cursor.execute(cmd)
print(cmd)
Impala can execute multiple queries at the same time as long as it doesn't hit the memory cap.
You can issue a command like impala-shell -f <<file_name>>, where the file has multiple queries each complete query separated by a semi colon (;)
If you are a python geek, you can even try the impyla package to create multiple connections and run all your queries at once.
pip install impyla