airflow run dag with arguments on remote webserver - airflow

I would like to kick off dags on a remote webserver. These dags require arguments in order to make sense. Locally, I use a command like this:
airflow trigger_dag dag_id --conf '{"parameter":"~/path" }'
The problem is that this assumes I'm running locally. How can I trigger a dag on a remote airflow server with arguments? I realize I could use the ui to hit the play button, but that doesn't allow you to pass arguments that I am aware of.

Example url:
http://localhost:8080/api/experimental/dags/<dag_id>/dag_runs
Example post payload(application/json):
{"conf":"{\"client\":\"popsicle\"}"}
Note that the embedded conf object must be a string, not an object.

Related

How to pass a conf to a scheduled DAG

When the DAG is triggered manually there are multiple ways to pass the config. It could be done from the UI, via the airflow CLI using --conf argument & using the REST API.
But when a DAG is scheduled using a cron expression, the DAG always fails because the tasks in the DAG are expecting the values from conf.
Is there a DAG level configuration which can be used to set "default" values for conf values (WITHOUT doing a null check in the Python code itself and hardcoding a default value)
The reason I do not want to do this null check in the code itself is because I want the conf keys & default values to be exposed via an Airflow API if possible

How do I check if there are DAGs running in Airflow (before restarting Airflow)?

I need to restart Airflow. I want to make sure I do it when it's idle, so I that I don't interrupt a job by restarting the worker component of Airflow.
How do I see what DAGs are running?
I don't see anything in the UI that would list currently running DAGs.
I don't see any command in the airflow CLI to list currently running DAGs.
I found airflow shell that lets me connect to the DB, but I don't know enough about Airflow internals to know where to look to see what's running.
You can also query the database to get all running tasks at once:
select * from task_instance where state='running'
You can use the command line command airflow jobs check which would return "No alive jobs found." in the event no jobs are running.
I found it... it's in the UI, on the DAGs page, it's the second circle under "Recent Tasks":

How to prevent "Execution failed:[Errno 32] Broken pipe" in Airflow

I just started using Airflow to coordinate our ETL pipeline.
I encountered the pipe error when I run a dag.
I've seen a general stackoverflow discussion here.
My case is more on the Airflow side. According to the discussion in that post, the possible root cause is:
The broken pipe error usually occurs if your request is blocked or
takes too long and after request-side timeout, it'll close the
connection and then, when the respond-side (server) tries to write to
the socket, it will throw a pipe broken error.
This might be the real cause in my case, I have a pythonoperator that will start another job outside of Airflow, and that job could be very lengthy (i.e. 10+ hours), I wonder if what is the mechanism in place in Airflow that I can leverage to prevent this error.
Can anyone help?
UPDATE1 20190303-1:
Thanks to #y2k-shubham for the SSHOperator, I am able to use it to set up a SSH connection successfully and am able to run some simple commands on the remote site (indeed the default ssh connection has to be set to localhost because the job is on the localhost) and am able to see the correct result of hostname, pwd.
However, when I attempted to run the actual job, I received same error, again, the error is from the jpipeline ob instead of the Airflow dag/task.
UPDATE2: 20190303-2
I had a successful run (airflow test) with no error, and then followed another failed run (scheduler) with same error from pipeline.
While I'd suggest you keep looking for a more graceful way of trying to achieve what you want, I'm putting up example usage as requested
First you've got to create an SSHHook. This can be done in two ways
The conventional way where you supply all requisite settings like host, user, password (if needed) etc from the client code where you are instantiating the hook. Im hereby citing an example from test_ssh_hook.py, but you must thoroughly go through SSHHook as well as its tests to understand all possible usages
ssh_hook = SSHHook(remote_host="remote_host",
port="port",
username="username",
timeout=10,
key_file="fake.file")
The Airflow way where you put all connection details inside a Connection object that can be managed from UI and only pass it's conn_id to instantiate your hook
ssh_hook = SSHHook(ssh_conn_id="my_ssh_conn_id")
Of course, if your'e relying on SSHOperator, then you can directly pass the ssh_conn_id to operator.
ssh_operator = SSHOperator(ssh_conn_id="my_ssh_conn_id")
Now if your'e planning to have a dedicated task for running a command over SSH, you can use SSHOperator. Again I'm citing an example from test_ssh_operator.py, but go through the sources for a better picture.
task = SSHOperator(task_id="test",
command="echo -n airflow",
dag=self.dag,
timeout=10,
ssh_conn_id="ssh_default")
But then you might want to run a command over SSH as a part of your bigger task. In that case, you don't want an SSHOperator, you can still use just the SSHHook. The get_conn() method of SSHHook provides you an instance of paramiko SSHClient. With this you can run a command using exec_command() call
my_command = "echo airflow"
stdin, stdout, stderr = ssh_client.exec_command(
command=my_command,
get_pty=my_command.startswith("sudo"),
timeout=10)
If you look at SSHOperator's execute() method, it is a rather complicated (but robust) piece of code trying to achieve a very simple thing. For my own usage, I had created some snippets that you might want to look at
For using SSHHook independently of SSHOperator, have a look at ssh_utils.py
For an operator that runs multiple commands over SSH (you can achieve the same thing by using bash's && operator), see MultiCmdSSHOperator

Determining if a DAG is executing

I am using Airflow 1.9.0 with a custom SFTPOperator. I have code in my DAGs that poll an SFTP site to find new files. If any are found, then I create custom task id's for the dynamically created task and retrieve/delete the files.
directory_list = sftp_handler('sftp-site', None, '/', None, SFTPToS3Operation.LIST)
for file_path in directory_list:
... SFTP code that GET's the remote files
That part works fine. It seems both the airflow webserver and airflow scheduler are iterating through all the DAGs once a second and actually running the code that retrieves the directory_list. This means I'm hitting the SFTP site ~2 x a second to authenticate and pull a list of files. I'd like to have some conditional code that only executes if the DAG is actually being run.
When an SFTP site uses password authentication, the # of times I connect really isn't an issue. One site requires key authentication and if there are too many authentication failures in a short timespan, the account is locked. During my testing, this seems to happen occasionally for reasons I'm still trying to track down.
However, if I were authenticating only when the DAG was scheduled to execute, or executing manually, this would not be an issue. It also seems wasteful to spend so much time connecting to an SFTP site when it's not scheduled to do so.
I've seen a post that can check to see if a task is executing, but that's not ideal as I'd have to create a long-running task, using up resources I shouldn't require, just to perform that test. Any thoughts on how to accomplish this?
You have a very good use case for Airflow (SFTP to _____ batch jobs), but Airflow is not meant for dynamic DAGs as you are attempting to use them.
Top-Level DAG Code and the Scheduler Loop
As you noticed, any top-level code in a DAG is executed with each scheduler loop. Or put another way, every time the scheduler loop processes the files in your DAG directory it is interpreting all the code in your DAG files. Anything not in a task or operator is interpreted/executed immediately. This puts undue strain on the scheduler as well as any external systems you are making calls to.
Dynamic DAGs and the Airflow UI
Airflow does not handle dynamic DAGs through the UI well. This is mostly the result of the Airflow DAG state not being stored in the database. DAG views and history are rendered based on what exist in the interpreted DAG file at any given moment. I personally hope to see this change in the future with some form of DAG versioning.
In a dynamic DAG you can both add and remove tasks from a DAG.
Adding Tasks Dynamically
When adding tasks for a DAG run will make it appear (in the UI) that all DAG
runs before when that task never ran that task all. The will have a None state
and the DAG run will be set to success or failed depending on the outcome
of the DAG run.
Removing Tasks Dynamically
If your dynamic DAG ever removes tasks you will lose the ability to review history of the DAG. For example, if you run a DAG with task_x in the first 20 DAG runs but remove it after that, it will fail to show up in the UI until it is added back into the DAG.
Idempotency and Airflow
Airflow works best when the DAG runs are idempotent. This means that re-running any DAG Run should have the same affect no matter when you run it or how many times you run it. Dynamic DAGs in Airflow break idempotency by adding and removing tasks to previous DAG runs so that the results of re-running are not the same.
Solution Options
You have at least two options moving forward
1.) Continue to build your SFTP DAG dynamically, but create another DAG that writes the available SFTP files to a local file (if not using distributed executor) or an Airflow Variable (this will result in more reads to the Airflow DB) and build your DAG dynamically from that.
2.) Overload the SFTPOperator to take a list of files so that every file that exist is processed within a single task run. This will make the DAGs idempotent and you will maintain accurate history through the logs.
I apologize for the extended explanation, but you're touching on one of the rough spots of Airflow and I felt it was appropriate to give an overview of the problem at hand.

Passing parameters to Airflow's jobs through UI

Is it possible to pass parameters to Airflow's jobs through UI?
AFAIK, 'params' argument in DAG is defined in python code, therefore it can't be changed at runtime.
Depending on what you're trying to do, you might be able to leverage Airflow Variables. These can be defined or edited in the UI under the Admin tab. Then your DAG code can read the value of the variable and pass the value to the DAG(s) it creates.
Note, however, that although Variables let you decouple values from code, all runs of a DAG will read the same value for the variable. If you want runs to be passed different values, your best bet is probably to use airflow templating macros and differentiate macros with the run_id macro or similar
Two ways to change your DAG behavior:
Use Airflow variables like mentioned by Bryan in his answer.
Use Airflow JSON Conf to pass JSON data to a single DAG run. JSON can be passed either from
UI - manual trigger from tree view
UI - create new DAG run from browse > DAG runs > create new record
or from
CLI
airflow trigger_dag 'MY_DAG' -r 'test-run-1' --conf '{"exec_date":"2021-09-14"}'
Within the DAG this JSON can be accessed using jinja templates or in the operator callable function context param.
def do_some_task(**context):
print(context['dag_run'].conf['exec_date'])
task1 = PythonOperator(
task_id='task1_id',
provide_context=True,
python_callable=do_some_task,
dag=dag,
)
#access in templates
task2 = BashOperator(
task_id="task2_id",
bash_command="{{ dag_run.conf['exec_date'] }}",
dag=dag,
)
Note that the JSON conf will not be present during scheduled runs. The best use case for JSON conf is to override the default DAG behavior. Hence set meaningful defaults in the DAG code so that during scheduled runs JSON conf is not used.

Resources