Airflow Map-Reduce Operator - oozie

I am looking for an airflow map reduce operator. I use OOZIE map reduce action now, and trying to convert to Airflow. I have looked under
https://airflow.apache.org/docs/apache-airflow/1.10.14/_api/airflow/contrib/operators/index.html, and
https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/index.html
could not find any map reduce operator. Please advise.

I do not think there is a generic "HadoopMR" Operator.
If you are running (by chance) your Hadoop Claster using DataProc - you have the operator to submit a job there:
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index.html#airflow.providers.google.cloud.operators.dataproc.DataprocSubmitHadoopJobOperator
But if you need to run it in your on-premise Hadoop Cluster, then you likely need to run some BashOperator with Hadoop command line tools.

Related

How to parameterize DataprocSubmitJobOperator in airflow

In airflow while triggering dataproc spark jobs ,we can pass the parameters required by the spark jobs.
However, wondering if we can pass parameters based on schedules.
Eg.
If the date is between 1-7, execute dataproc jobs using X parameter.
If the date is between 8-end_of_month, execute dataproc jobs using Y parameter.
This will help to avoid redundant DAGs.

AirFlow - disable dag after X consecutive fails

I read the API reference and couldnt find anything on it, is that possible?
Currently, there is no such feature that does it out-of-the-box but you can write some custom code in your DAG to get around this. For example, use PythonOperator (you can use MySQL operator if your metadata db is mysql) to get status of the last X runs for the dag.
use BranchPythonOperator to see if the number is more than X, if it is then use a BashOperator to run airflow pause dag cli.
You can also just make this a 2-step task by adding logic of PythonOperator in BranchPythonOperator. This is just an idea, you can use a different logic.

How to prevent "Execution failed:[Errno 32] Broken pipe" in Airflow

I just started using Airflow to coordinate our ETL pipeline.
I encountered the pipe error when I run a dag.
I've seen a general stackoverflow discussion here.
My case is more on the Airflow side. According to the discussion in that post, the possible root cause is:
The broken pipe error usually occurs if your request is blocked or
takes too long and after request-side timeout, it'll close the
connection and then, when the respond-side (server) tries to write to
the socket, it will throw a pipe broken error.
This might be the real cause in my case, I have a pythonoperator that will start another job outside of Airflow, and that job could be very lengthy (i.e. 10+ hours), I wonder if what is the mechanism in place in Airflow that I can leverage to prevent this error.
Can anyone help?
UPDATE1 20190303-1:
Thanks to #y2k-shubham for the SSHOperator, I am able to use it to set up a SSH connection successfully and am able to run some simple commands on the remote site (indeed the default ssh connection has to be set to localhost because the job is on the localhost) and am able to see the correct result of hostname, pwd.
However, when I attempted to run the actual job, I received same error, again, the error is from the jpipeline ob instead of the Airflow dag/task.
UPDATE2: 20190303-2
I had a successful run (airflow test) with no error, and then followed another failed run (scheduler) with same error from pipeline.
While I'd suggest you keep looking for a more graceful way of trying to achieve what you want, I'm putting up example usage as requested
First you've got to create an SSHHook. This can be done in two ways
The conventional way where you supply all requisite settings like host, user, password (if needed) etc from the client code where you are instantiating the hook. Im hereby citing an example from test_ssh_hook.py, but you must thoroughly go through SSHHook as well as its tests to understand all possible usages
ssh_hook = SSHHook(remote_host="remote_host",
port="port",
username="username",
timeout=10,
key_file="fake.file")
The Airflow way where you put all connection details inside a Connection object that can be managed from UI and only pass it's conn_id to instantiate your hook
ssh_hook = SSHHook(ssh_conn_id="my_ssh_conn_id")
Of course, if your'e relying on SSHOperator, then you can directly pass the ssh_conn_id to operator.
ssh_operator = SSHOperator(ssh_conn_id="my_ssh_conn_id")
Now if your'e planning to have a dedicated task for running a command over SSH, you can use SSHOperator. Again I'm citing an example from test_ssh_operator.py, but go through the sources for a better picture.
task = SSHOperator(task_id="test",
command="echo -n airflow",
dag=self.dag,
timeout=10,
ssh_conn_id="ssh_default")
But then you might want to run a command over SSH as a part of your bigger task. In that case, you don't want an SSHOperator, you can still use just the SSHHook. The get_conn() method of SSHHook provides you an instance of paramiko SSHClient. With this you can run a command using exec_command() call
my_command = "echo airflow"
stdin, stdout, stderr = ssh_client.exec_command(
command=my_command,
get_pty=my_command.startswith("sudo"),
timeout=10)
If you look at SSHOperator's execute() method, it is a rather complicated (but robust) piece of code trying to achieve a very simple thing. For my own usage, I had created some snippets that you might want to look at
For using SSHHook independently of SSHOperator, have a look at ssh_utils.py
For an operator that runs multiple commands over SSH (you can achieve the same thing by using bash's && operator), see MultiCmdSSHOperator

airflow run dag with arguments on remote webserver

I would like to kick off dags on a remote webserver. These dags require arguments in order to make sense. Locally, I use a command like this:
airflow trigger_dag dag_id --conf '{"parameter":"~/path" }'
The problem is that this assumes I'm running locally. How can I trigger a dag on a remote airflow server with arguments? I realize I could use the ui to hit the play button, but that doesn't allow you to pass arguments that I am aware of.
Example url:
http://localhost:8080/api/experimental/dags/<dag_id>/dag_runs
Example post payload(application/json):
{"conf":"{\"client\":\"popsicle\"}"}
Note that the embedded conf object must be a string, not an object.

Use Airflow for batch processing to dynamically start multiple tasks based on the output of a parent task

I am trying to figure out if Airflow can be used to express a workflow where multiple instances of the same task need to be started based on the output of a parent task. Airflow supports multiple workers, so I naively expect that Airflow can be used to orchestrate workflows involving batch processing. So far I failed to find any recipe/direction that would fit this model. What is the right way to leverage Airflow for a bath processing workflow like the one below? Assume there is a pool of Airflow workers.
Example of a workflow:
1. Start Task A to produce multiple files
2. For each file start an instance of Task B (might be another workflow)
3. Wait for all instances of Task B, then start Task C
As a hack to parallelize processing of input data in Airflow, I use a custom operator that splits the input into a predetermined number of partitions. The downstream operator gets replicated for each partition and if needed the result can be merged again. For local files, the operator runs the split command. In Kubernetes, this works nicely with the cluster autoscaling.

Resources