I'm new in google dataflow. I have 2 dataflow pipeline to execute 2 difference job. One is ETL process and load to Bigquery and another one is read from Bigquery to aggregate for report.
I want to run pipeline ETL firt and after it complete the reports pipeline will run to make sure data in bigquery is latest update.
I had tried to run in one pipe line but it can't help. Now I have to run manual for ETL first and then I run report pipeline.
Can any body give me some advice to run 2 job in one pipeline.
Thanks.
You should be able to do both of these in a single pipeline. Rather than writing to BigQuery and then trying to read that back in and generate the report, consider just using the intermediate data for both purposes. For example:
PCollection<Input> input = /* ... */;
// Perform your transformation logic
PCollection<Intermediate> intermediate = input
.apply(...)
.apply(...);
// Convert the transformed results into table rows and
// write those to BigQuery.
intermediate
.apply(ParDo.of(new IntermediateToTableRowETL())
.apply(BigQueryIO.write(...));
// Generate your report over the transformed data
intermediate
.apply(...)
.apply(...);
Related
Context: We store historical data in Azure Data Lake as versioned parquet files from our existing Databricks pipeline where we write to different Delta tables. One particular log source is about 18 GB a day in parquet. I have read through the documentation and executed some queries using Kusto.Explorer on the external table I have defined for that log source. In the query summary window of Kusto.Explorer I see that I download the entire folder when I search it, even when using the project operator. The only exception to that seems to be when I use the take operator.
Question: Is it possible to prune columns to reduce the amount of data being fetched from external storage? Whether during external table creation or using an operator at query time.
Background: The reason I ask is that in Databricks it is possible to use the SELCECT statement to only fetch the columns I'm interested in. This reduces the query time significantly.
As David wrote above, the optimization does happen on Kusto side, but there's a bug with the "Downloaded Size" metric - it presents the total data size, regardless of the selected columns. We'll fix. Thanks for reporting.
I have a project in BQ that consists of sending reports in csv format on a daily basis.
These reports are the results of queries to bigquery which are then compressed to csv and mailed.
Use the implementation of the following question to solve my problem.
How to run a BigQuery query and then send the output CSV to Google Cloud Storage in Apache Airflow?
Now, I am trying to change that implementation.
the reason is because:
the main reason is because I don't like the idea of creating temporary tables to export the result. I have not found an operator that exports the result of a query.
I don't need to take the data to storage, especially if I'm going to download it to the local airflow directory anyway.
Try using "get_pandas_df" from
bigquery_hook and then pass the result through xcom to another task that will be in charge of compressing to csv. Due to the heavyness of the Dataframe it was not possible.
Do you have any idea how to do it directly?
In Airflow it is equally easy to use existing operators as well as write your own operators. This is all Python. Airflow has two-layered approach for external services - it has Operators (where each operators does a single operation) and Hooks (which is a super-easy to use interface providing API to communicate with external services.
In your case rather than composing the existing operators, you should rather create your own operator by using multiple hooks. One Hook to read the data to pandas frame for example, then a bit of Python code to extract the data in a form that you can attach to the mail and then use 'send_email' from util to send the email (there is no separate Hook for sending email because sending emails is a standard feature of Airflow Core as well). You can take a look at the EmailOperator code to see how send_email is used and also BigQueryOperators on how to use BigQueryHook.
You can do it in two ways:
Classic - define your own operator as object (you can do it in youd DAG file and use it in your DAG).
class MyOperator(BaseOperator):
__init__.....
def execute():
bq_hook = BigQueryHook(.....)
... do stuff ...
send_email(....)
Task Flow API (which is much more Pythonic/functional and less boilerplate):
#dag(...)
def my_dag():
#task()
def read_data_and_send_email():
bq_hook = BigQueryHook(.....)
... do stuff ...
send_email(....)
Task flow is I think better for your needs: see http://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html
I am new to airflow, I took some courses about it but did not come across any example for my use case, I would like to:
Fetch data from postgres (data is usually 1m+ rows assuming it is large for xcoms)
(In some cases process the data if needed but this can be usually done inside the query itself)
Insert data into oracle
I tend to see workflows like exporting data into a csv first (from postgres), then loading it into destination database. However, I feel like it would be best to do all these 3 tasks in a single python operator (for example looping with a cursor and bulk inserting) but not sure if this is suitable for airflow.
Any ideas on possible solutions to this situation? What is the general approach?
As you mentioned there are several options.
To name a few:
Doing everything in one python task.
Create a pipeline
Create a custom operator.
All approaches are valid each one has advantages and disadvantages.
First approach:
You can write a python function that uses PostgresqlHook to create a dataframe and then load it to oracle.
def all_in_one(context):
pg_hook = PostgresHook('postgres_conn')
df = pg_hook.get_pandas_df('SELECT * FROM table')
# do some transformation on df as needed and load to oracle
op = PyhtonOperator(task_id='all_in_one_task',
python_callable=all_in_one,
dag=dag
)
Advantages :
easy coding (for people who are used to write python scripts)
Disadvantages:
not suitable for large transfers as it's in memory.
If you need to backfill or rerun the entire function is executed. So if there is an issue with loading to oracle you will still rerun the code that fetch records from PostgreSQL.
Second approach:
You can implement your own MyPostgresqlToOracleTransfer with any logic you wish. This is useful if you want to reuse the functionality in different DAGs
Third approach:
Work with files (data lake like).
the file can be on local machine if you have only 1 worker, if not the file must be uploaded to a shared drive (S3, Google Storage, any other disk that can be accessed by all workers).
Possible pipeline can be:
PostgreSQLToGcs -> GcsToOracle
Depends on what service you are using, some of the required operators may already been implemented by Airflow.
Advantages :
Each task stand for itself thus if you successful exported the data to disk, in event of backfill / failure you can just execute the failed operators and not the whole pipe. You can also save the exported files in cold storage in case you will need to rebuild from history.
Suitable for large transfers.
Disadvantages:
Adding another service which is "not needed" (shared disk resource)
Summary
I prefer the 2nd/3rd approaches. I think it's more suitable to what Airflow provides and allow more flexibility.
Learning about Airflow and want to understand why it is not a data streaming solution.
https://airflow.apache.org/docs/apache-airflow/1.10.1/#beyond-the-horizon
Airflow is not a data streaming solution. Tasks do not move data from
one to the other (though tasks can exchange metadata!)
Not sure what the it means task do not move data from one to another. So I can't have tasks like Extract Data >> Calculate A >> Calculate B using data from prev step >> another step that depends on prev task's result >> ... ?
Furthermore read the following a few times and still doesn't click with me. What's an example of a static vs dynamic workflow?
Workflows are expected to be mostly static or slowly changing. You can think of the structure of the tasks in your workflow as slightly more dynamic than a database structure would be. Airflow workflows are expected to look similar from a run to the next, this allows for clarity around unit of work and continuity.
Can someone help provide an alternative explanation or example that can walk through why Airflow is not a good data streaming solution?
Not sure what the it means task do not move data from one to another. So I can't have tasks like Extract Data >> Calculate A >> Calculate B using data from prev step >> another step that depends on prev task's result >> ... ?
You can use XCom (cross-communication) to use data from previous step BUT bear in mind that XCom is stored in the metadata DB meaning that you should not use XCom for holding data that is being processed. You should use XCom for holding a reference to data (e.g.: S3/HDFS/GCS path, BigQuery table, etc).
So instead of passing data between tasks, in Airflow you pass reference to data between tasks using XCom. Airflow is an orchestrator, heavy tasks should be off-loaded to other data processing cluster (can be Spark cluster, BigQuery, etc).
I have a requirement to get the result of different Transaction codes (TCodes), that extract has to be loaded into an SQL database. There are some TCODEs with complex logic, so replicate the logic is not an option.
Is there any way to have a daily basis process that runs all the tcodes and locates the extract in a onedrive or any other location?
I just need the same result as if a user get into the tcode, executes, and extract to a CSV file.