I'm using Hadoop and working with a map task that creates files that I want to keep, currently I am passing these files through the collector to the reduce task. The reduce task then passes these files on to its collector, this allows me to retain the files.
My question is how do I reliably and efficiently keep the files created by map?
I know I can turn off the automatic deletion of map's output, but that is frowned upon are they any better approaches?
You could split it up into two jobs.
First create a map only job outputting the sequence files you want.
Then, taking your existing job (doing really nothing in the map anymore but you could do some crunching depending on your implementation & use cases) and reducing as you do now inputting the previous map only job through as your input to the second job.
You can wrap this all up in one jar running the 2 jars as such passing the output path as an argument to the second jobs input path.
Related
I am new to airflow, I took some courses about it but did not come across any example for my use case, I would like to:
Fetch data from postgres (data is usually 1m+ rows assuming it is large for xcoms)
(In some cases process the data if needed but this can be usually done inside the query itself)
Insert data into oracle
I tend to see workflows like exporting data into a csv first (from postgres), then loading it into destination database. However, I feel like it would be best to do all these 3 tasks in a single python operator (for example looping with a cursor and bulk inserting) but not sure if this is suitable for airflow.
Any ideas on possible solutions to this situation? What is the general approach?
As you mentioned there are several options.
To name a few:
Doing everything in one python task.
Create a pipeline
Create a custom operator.
All approaches are valid each one has advantages and disadvantages.
First approach:
You can write a python function that uses PostgresqlHook to create a dataframe and then load it to oracle.
def all_in_one(context):
pg_hook = PostgresHook('postgres_conn')
df = pg_hook.get_pandas_df('SELECT * FROM table')
# do some transformation on df as needed and load to oracle
op = PyhtonOperator(task_id='all_in_one_task',
python_callable=all_in_one,
dag=dag
)
Advantages :
easy coding (for people who are used to write python scripts)
Disadvantages:
not suitable for large transfers as it's in memory.
If you need to backfill or rerun the entire function is executed. So if there is an issue with loading to oracle you will still rerun the code that fetch records from PostgreSQL.
Second approach:
You can implement your own MyPostgresqlToOracleTransfer with any logic you wish. This is useful if you want to reuse the functionality in different DAGs
Third approach:
Work with files (data lake like).
the file can be on local machine if you have only 1 worker, if not the file must be uploaded to a shared drive (S3, Google Storage, any other disk that can be accessed by all workers).
Possible pipeline can be:
PostgreSQLToGcs -> GcsToOracle
Depends on what service you are using, some of the required operators may already been implemented by Airflow.
Advantages :
Each task stand for itself thus if you successful exported the data to disk, in event of backfill / failure you can just execute the failed operators and not the whole pipe. You can also save the exported files in cold storage in case you will need to rebuild from history.
Suitable for large transfers.
Disadvantages:
Adding another service which is "not needed" (shared disk resource)
Summary
I prefer the 2nd/3rd approaches. I think it's more suitable to what Airflow provides and allow more flexibility.
In the docs, they say that you should avoid passing data between tasks:
This is a subtle but very important point: in general, if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. If it absolutely can’t be avoided, Airflow does have a feature for operator cross-communication called XCom that is described in the section XComs.
I fundamentally don't understand what they mean. If there's no data to pass between two tasks, why are they part of the same DAG?
I've got half a dozen different tasks that take turns editing one file in place, and each send an XML report to a final task that compiles a report of what was done. Airflow wants me to put all of that in one Operator? Then what am I gaining by doing it in Airflow? Or how can I restructure it in an Airflowy way?
fundamentally, each instance of an operator in a DAG is mapped to a different task.
This is a subtle but very important point: in general if two operators need to share
information, like a filename or small amount of data, you should consider combining them
into a single operator
the above sentence means that if you want any information that needs to be shared between two different tasks then it is best you could combine them into one task instead of using two different tasks, on the other hand, if you must use two different tasks and you need to pass some information from one task to another then you can do it using
Airflow's XCOM, which is similar to a key-value store.
In a Data Engineering use case, file schema before processing is important. imagine two tasks as follows :
Files_Exist_Check : the purpose of this task is to check whether particular files exist in a directory or not
before continuing.
Check_Files_Schema: the purpose of this task is to check whether the file schema matches the expected schema or not.
It would only make sense to start your processing if Files_Exist_Check task succeeds. i.e. you have some files to process.
In this case, you can "push" some key to xcom like "file_exists" with the value being the count of files present in that particular directory in Task Files_Exist_Check.
Now, you "pull" this value using the same key in Check_Files_Schema Task, if it returns 0 then there are no files for you to process hence you can raise exception and fail the task or handle gracefully.
hence sharing information across tasks using xcom does come in handy in this case.
you can refer following link for more info :
https://www.astronomer.io/guides/airflow-datastores/
Airflow - How to pass xcom variable into Python function
What you have to do for avoiding having everything in one operator is saving the data somewhere. I don't quite understand your flow, but if for instance, you want to extract data from an API and insert that in a database, you would need to have:
PythonOperator(or BashOperator, whatever) that takes the data from the API and saves it to S3/local file/Google Drive/Azure Storage...
SqlRelated operator that takes the data from the storage and insert it into the database
Anyway, if you know which files are you going to edit, you may also use jinja templates or reading info from a text file and make a loop or something in the DAG. I could help you more if you clarify a little bit your actual flow
I've decided that, as mentioned by #Anand Vidvat, they are making a distinction between Operators and Tasks here. What I think is that they don't want you to write two Operators that inherently need to be paired together and pass data to each other. On the other hand, it's fine to have one task use data from another, you just have to provide filenames etc in the DAG definition.
For example, many of the builtin Operators have constructor parameters for files, like the S3FileTransformOperator. Confusing documentation, but oh well!
I would like to know how to transfer data between tasks without storing them in between.
Attached image one can find the flow of tasks. As of now I am storing the output csv files of each task as a file in my local machine and fetching this csv file again as an input to next task. I wanted to know if there is any otherway to pass data between tasks without storing it after each task. I researched a bit and came across Xcoms. I wanted to make sure if Xcoms are the right way to achieve this or am I wrong. I could not find any practical examples. Any help is appreciated as I am just a newbie in airflow started couple of days
Short answer is no, tasks require data to be at rest before moving to the nest task. Xcom's are most suited to short strings that can be shared between tasks (file directories, object names, etc.). Your current flow of storing the data in csv files between tasks is the optimal way of running your flow.
XCom is intended for sharing little pieces of information, like the len of the sql table, any specific values or things like that. It is not made for sharing dataframes (which can be huge) because the shared information is written in the metadata database.
So either you keep exporting the csv to your computer (or uploading them somewhere), for reading it in the next Operator, or you combine the operators into one.
I have an mrjob that consists of 3 steps.
The second step expects as input the results of the first step plus some more content from S3.
I understand that I can always "stream" it through the first step, meaning emit is as is, and only use it in the second step, but I would like to avoid this.
Is there a way to define additional input to later steps in mrjob?
Instead of grouping the steps into a single job, you might consider using a persistent job flow to separate your task into the parts before and after the secondary input:
Re-use Amazon Elastic MapReduce instance
http://pythonhosted.org/mrjob/guides/emr-advanced.html
I have a use case where multiple jobs can run at the same time. The output of all the jobs will have to merged with a common master file in HDFS(containing key value pairs) that has no duplicates. I'm not sure how to avoid the race condition that could crop up in this case. As an example both Job 1 and Job 2 simultaneously write the same value to the master file resulting in duplicates. Appreciate your help on this.
Apache Hadoop doesn't support parallel writing to the same file. Here is the reference.
Files in HDFS are write-once and have strictly one writer at any time.
So, multiple maps/jobs can't write to the same file simultaneously. Another job/shell or any other program has to be written to merge the output of multiple jobs.