Airflow policy: How to define and configure one? - airflow

I'm trying to define an airflow policy but I'm lost and there's not that much documentation. I create a file inside $HOME_AIRFLOW/config called airflow_settings.py. Inside I have the policy that tries to change the queue of a failed task. The policy doesn't work. Does anybody have an example?
By the way I'm using airflow 1.10
Thanks

The file should be named airflow_local_settings.py, and it should be inside your PYTHONPATH.

Related

Unable to access Airflow REST API

I have setup airflow in my local machine. I am trying to access the below airflow link:
http://localhost:8080/api/experimental/test/
I am getting Airflow 404 = lots of circles
I have tried to set auth_backend to default, but no luck.
What changes do i need to make in airflow.cfg to be able to make REST API calls to airflow for triggering DAGs?
Experimental API is disabled by default in Airlfow 2. It was used in 1.10 but it has been deprecated and disabled by default in Airflow 2. Instead you should use the fully-fledged REST API which uses completely different URL scheme:
https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html
In Airflow UI you can even browse and try the API (just look at the menus of Airflow).

Airflow Cluster Policy not taking effect

I'm attempting to use a Cluster Policy in Airflow 1.9. I followed the instructions in the official documentation, but it doesn't seem to be taking effect.
In my file at $AIRFLOW_HOME/config/airflow_local_settings.py, I've defined the method as the docs instructed and it has the following signature:
def policy(task_instance):
Additional concerns:
What Airflow component is actually running the policy code (is it the scheduler)?
Is there a recommended way to unit test cluster policy code? If not, what about local testing?
Can anyone help me understand why this Cluster Policy isn't taking effect?
I'm using Airflow 1.9.
So you seem to have the file in the right place according to the documents: https://github.com/apache/airflow/blob/master/docs/concepts.rst#where-to-put-airflow_local_settingspy
And your signature is right: https://airflow.apache.org/docs/stable/concepts.html#mutate-tasks-after-dag-loaded
But you haven't shown what you did and how that "did not work".
I believe the def policy(task): signature is run on the scheduler after DAG parsing (as the docs seem to say) while the def task_instance_mutation_hook(ti): signature is run by the task executor on the worker. That's probably why you're not seeing some changes take.
EG timeout or queue is something the scheduler enforces, but connection ID is something the worker needs to know during execution.
So if what you wanted to work was a timeout policy, it should have, but if what you wanted to work was a connection ID enforcement, that wouldn't have.

Oozie launch script after Coordinator Start

I'm looking for a way to launch a custom script when a coordinator start.
So when a coordinator start the running of a job, I'd need to make for example an api call to a third party service.
Is there a way or a workaround to make this possible?
Thank you
Solution found: the key is the property oozie.wf.workflow.notification.url
add in the workflow configuration file, the following parameter
<property>
<name>oozie.wf.workflow.notification.url</name>
<value>http://server_name:8080/oozieNotification/jobUpdate?jobId=$jobId%26status=$status</value>
and create a webservice listening on this url

Whats the best way to log in oozie

We are using oozie workflows with a oozie main class in the action. I am not really sure what is the best logging strategy. Should we just use log4j since it seems like that is the default strategy ? Do those logs get collected on the data nodes ?
Should we just use log4j since it seems like that is the default
strategy ?
I have not found any mention of someone using an alternative logger. It seems to be discouraged:
While Oozie can technically use any valid log4j Appender or
configurations that violate the above restrictions, certain features
related to logs may be disabled and/or not work correctly, and is thus
not advised.
Your other question:
Do those logs get collected on the data nodes ?
An SO answer mentions that
the logs are distributed across your cluster, but by logging them to
the rootLogger, you should be able to see them via the job tracker (by
drilling down on the Job task attempts).
You can inspect them via
use this to print last 10 lines
$ oozie job -oozie oozie_URL -log job_ID | tail -n 10

Process scheduler runtime parameter

Can anyone recommend a fairly clean method for determining the process scheduler an app-engine is running on at run-time (NT or UNIX). I need to set a file path that is obviously dependent upon the server the process is being executed on. I understand the GetEnv command can be used, but I don't want to set an environment variable for this particular instance (it does not reside under the PS_FILES) path. I've searched peoplebooks for any kind of built in function or system variable, but was not successful (obviously).
Any suggestions would be appreciated.
Thanks
Okay, I may have asked this question a little too early. I apologize.
It looks like I'll just be able to query the process request table to pull back the server name:
SQLExec("SELECT SERVERNAMERUN FROM PSPRCSRQST WHERE PRCSINSTANCE = :1", &thisProcess, &server);
Evaluate &server
When.......
End-Evaluate;
Exactly :-)
There are a host of Process Request records that can give you the information you need. Glad that you found it.
John

Resources