Whats the best way to log in oozie - oozie

We are using oozie workflows with a oozie main class in the action. I am not really sure what is the best logging strategy. Should we just use log4j since it seems like that is the default strategy ? Do those logs get collected on the data nodes ?

Should we just use log4j since it seems like that is the default
strategy ?
I have not found any mention of someone using an alternative logger. It seems to be discouraged:
While Oozie can technically use any valid log4j Appender or
configurations that violate the above restrictions, certain features
related to logs may be disabled and/or not work correctly, and is thus
not advised.
Your other question:
Do those logs get collected on the data nodes ?
An SO answer mentions that
the logs are distributed across your cluster, but by logging them to
the rootLogger, you should be able to see them via the job tracker (by
drilling down on the Job task attempts).
You can inspect them via
use this to print last 10 lines
$ oozie job -oozie oozie_URL -log job_ID | tail -n 10

Related

Custom Operator States (queued, success, etc.) in Apache Airflow?

In Apache Airflow (2.x), each Operator Instance has a state as defined here (airflow source repo).
I have two use cases that don't seem to clearly fall into the pre-defined states:
Warn, but don't fail - This seems like it should be a very standard use case and I am surprised to not see it in the out-of-the-box airflow source code. Basically, I'd like to color-code a node with something eye-catching - say orange - corresponding to a non-fatal warning, but continue execution as normal otherwise. Obviously you can print warnings to the log, but finding them takes more work than just looking at the colorful circles on the DAGs page.
"Sensor N/A" or "Data not ready" - This would be a status that gets assigned when a sensor notices that data in the source system is not yet ready, and that downstream operators can be skipped until the next execution of the DAG, but that nothing in the data pipeline is broken. Basically an expected end-of-branch.
Is there a good way of achieving either of these use cases with the out-of-the-box Airflow node states? If not, is there a way to defining custom operator states? Since I am running airflow on a managed service (MWAA), I don't think changing the source code of our deployment is an option.
Thanks,
The task states are tightly integrated with Airflow. There's no way to configure which logging levels lead to which state. I'd say the easiest way is to grep log files for "WARNING" or set up a log aggregation service e.g. Elasticsearch to make log files searchable.
For #2, sensors have no knowledge about why a sensor timed out. After timeout or execution_timeout is reached, they simply raise an Exception. You can deal with exceptions using trigger_rules, but these still don't take the reason for an exception into account.
If you want more control over this, I would implement your own Sensor which takes an argument e.g. data_not_ready_timeout (which is smaller than timeout and execution_timeout). In the poke() method, check if data_not_ready_timeout has been reached, and raise an AirflowSkipException if so. This will skip downstream tasks. Once timeout or execution_timeout are reached, the task is failed. Look at BaseSensorOperator.execute() for some inspiration to get the initial starting date of a sensor.

Airflow Cluster Policy not taking effect

I'm attempting to use a Cluster Policy in Airflow 1.9. I followed the instructions in the official documentation, but it doesn't seem to be taking effect.
In my file at $AIRFLOW_HOME/config/airflow_local_settings.py, I've defined the method as the docs instructed and it has the following signature:
def policy(task_instance):
Additional concerns:
What Airflow component is actually running the policy code (is it the scheduler)?
Is there a recommended way to unit test cluster policy code? If not, what about local testing?
Can anyone help me understand why this Cluster Policy isn't taking effect?
I'm using Airflow 1.9.
So you seem to have the file in the right place according to the documents: https://github.com/apache/airflow/blob/master/docs/concepts.rst#where-to-put-airflow_local_settingspy
And your signature is right: https://airflow.apache.org/docs/stable/concepts.html#mutate-tasks-after-dag-loaded
But you haven't shown what you did and how that "did not work".
I believe the def policy(task): signature is run on the scheduler after DAG parsing (as the docs seem to say) while the def task_instance_mutation_hook(ti): signature is run by the task executor on the worker. That's probably why you're not seeing some changes take.
EG timeout or queue is something the scheduler enforces, but connection ID is something the worker needs to know during execution.
So if what you wanted to work was a timeout policy, it should have, but if what you wanted to work was a connection ID enforcement, that wouldn't have.

Which is the best scheduler for HADOOP. oozie or cron?

Can anyone please suggest which is best suited scheduler for Hadoop. If it is oozie.
How is oozie different from cron jobs.
Oozie is the best option.
Oozie Coordinator allows triggering actions when files arrive at HDFS. This will be challenging to implement anywhere else.
Oozie gets callbacks from MapReduce jobs so it knows when they finish and whether they hang without expensive polling. No other workflow manager can do this.
There are some benefits over crontab or any other, pointing some links
https://prodlife.wordpress.com/2013/12/09/why-oozie/
Oozie is able to start jobs on data availability, this is not free since someone has to say when the data are available.
Oozie allows you to build complex workflow using the mouse.
Oozie allows you to schedule workflow execution using the coordinator.
Oozie allows you to bundle one or more coordinators.
Using cron on hadoop is a bad idea but it's still fast, reliable, well known. Most of work which is free on oozie has to be coded if you are going to use cron.
Using oozie without Java means ( at the current date ) to meet a long list of dependency problem.
If you are a Java programmer oozie is a must.
Cron is still a good choice when you are in the test/verify stage.
Oozie separates specifications for workflow and schedule into a workflow specification and a coordinator specification, respectively. Coordinator specifications are optional, only required if you want to run a job repeatedly on a schedule. By convention you usually see workflow specifications in a file called workflow.xml and a coordinator specification in a file called coordinator.xml. The new cron-like scheduling affects these coordinator specifications. Let’s take a look at a coordinator specification that will cause a workflow to be run every weekday at 2 AM.
[xml]
<coordinator-app name="weekdays-at-two-am"
frequency="0 2 * * 2-6"
start="${start}" end="${end}" timezone="UTC"
xmlns="uri:oozie:coordinator:0.2">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
[/xml]
The key thing here is the frequency attribute in the coordinator-app element, here we see a cron-like specification that instructs Oozie when to run the workflow. The value for is specified in another properties file. The specification is “cron-like” and you might notice one important difference, days of the week are numbered 1-7 (1 being Sunday) as opposed to the 0-6 numbering used in standard cron.
For info visit:http://hortonworks.com/blog/new-in-hdp-2-more-powerful-scheduling-options-in-oozie/
Apache oozie is built to work with yarn and hdfs.
There are many features like data dependency, coordinator, workflow actions provided by oozie.
Oozie documentation
I think oozie is the best option
Sure you can use cron. But you will have to take lot of efforts to work with hadoop.

How to reschedule a coordinator job in OOZIE without restarting the job?

When i changed the start time of a coordinator job in job.properties in oozie, the job is not taking the changed time, instead its running in the old scheduled time.
Old job.properties:
startMinute=08
startTime=${startDate}T${startHour}:${startMinute}Z
New job.properties:
startMinute=07
startTime=${startDate}T${startHour}:${startMinute}Z
The job is not running at the changed time:07th minute,its running at 08th minute in every hour.
Please can you let me know the solution, how i can make the job pickup the updated properties(changed timing) without restarting or killing the job.
You can't really change the timing of the co-ordinator via any methods given by Oozie(v3.3.2) . When you submit a job the contents properties are stored in the database whereas the actual workflow is in the HDFS.
Everytime you execute the co-ordinator it is necessary to have the workflow in the path specified in properties during job submission but the properties file is not needed. What I mean to imply is the properties file does not come into the picture after submitting the job.
One hack is to update the time directly in the database using SQL query.But I am not sure about the implications of it.The property might become inconsistent across the database.
You have to kill the job and resubmit a new one.
Note: oozie provides a way to change the concurrency,endtime and pausetime as specified in the official docs.

UNIX - Stopping a custom service

I created a client-server application and now I would like to deploy it.
While development process I started the server on a terminal and when I wanted to stop it I just had to type "Ctrl-C".
Now want to be able to start it in background and stop it when I want by just typing:
/etc/init.d/my_service {stop|stop}
I know how to do an initscript, but the problem is how to actually stop the process ?
I first thought to retrieve the PID with something like:
ps aux | grep "my_service"
Then I found a better idea, still with the PID: Storing it on a file in order to retrieve it when trying to stop the service.
Definitely too dirty and unsafe, I eventually thought about using sockets to enable the "stop" process to tell the actual process to shut down.
I would like to know how this is usually done ? Or rather what is the best way to do it ?
I checked some of the files in the init.d and some of them use PID files but with a particular command "start-stop-daemon". I am a bit suspicious about this method which seems unsafe to me.
If you have a utility like start-stop-daemon available, use it.
start-stop-daemon is flexible and can use 4 different methods to find the process ID of the running service. It uses this information (1) to avoid starting a second copy of the same service when starting, and (2) to determine which process ID to kill when stopping the service.
--pidfile: Check whether a process has created the file pid-file.
--exec: Check for processes that are instances of this executable
--name: Check for processes with the name process-name
--user: Check for processes owned by the user specified by username or uid.
The best one to use in general is probably --pidfile. The others are mainly intended to be used in case the service does not create a PID file. --exec has the disadvantage that you cannot distinguish between two different services implemented by the same program (i.e. two copies of the same service). This disadvantage would typically apply to --name also, and, additionally, --name has a chance of matching an unrelated process that happens to share the same name. --user might be useful if your service runs under a dedicated user ID which is used by nothing else. So use --pidfile if you can.
For extra safety, the options can be combined. For example, you can use --pidfile and --exec together. This way, you can identify the process using the PID file, but don't trust it if the PID found in the PID file belongs to a process that is using the wrong executable (it's a stale/invalid PID file).
I have used the option names provided by start-stop-daemon to discuss the different possibilities, but you need not use start-stop-daemon: the discussion applies just as well if you use another utility or do the matching manually.

Resources