Which is the best scheduler for HADOOP. oozie or cron? - oozie

Can anyone please suggest which is best suited scheduler for Hadoop. If it is oozie.
How is oozie different from cron jobs.

Oozie is the best option.
Oozie Coordinator allows triggering actions when files arrive at HDFS. This will be challenging to implement anywhere else.
Oozie gets callbacks from MapReduce jobs so it knows when they finish and whether they hang without expensive polling. No other workflow manager can do this.
There are some benefits over crontab or any other, pointing some links
https://prodlife.wordpress.com/2013/12/09/why-oozie/

Oozie is able to start jobs on data availability, this is not free since someone has to say when the data are available.
Oozie allows you to build complex workflow using the mouse.
Oozie allows you to schedule workflow execution using the coordinator.
Oozie allows you to bundle one or more coordinators.
Using cron on hadoop is a bad idea but it's still fast, reliable, well known. Most of work which is free on oozie has to be coded if you are going to use cron.
Using oozie without Java means ( at the current date ) to meet a long list of dependency problem.
If you are a Java programmer oozie is a must.
Cron is still a good choice when you are in the test/verify stage.

Oozie separates specifications for workflow and schedule into a workflow specification and a coordinator specification, respectively. Coordinator specifications are optional, only required if you want to run a job repeatedly on a schedule. By convention you usually see workflow specifications in a file called workflow.xml and a coordinator specification in a file called coordinator.xml. The new cron-like scheduling affects these coordinator specifications. Let’s take a look at a coordinator specification that will cause a workflow to be run every weekday at 2 AM.
[xml]
<coordinator-app name="weekdays-at-two-am"
frequency="0 2 * * 2-6"
start="${start}" end="${end}" timezone="UTC"
xmlns="uri:oozie:coordinator:0.2">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
[/xml]
The key thing here is the frequency attribute in the coordinator-app element, here we see a cron-like specification that instructs Oozie when to run the workflow. The value for is specified in another properties file. The specification is “cron-like” and you might notice one important difference, days of the week are numbered 1-7 (1 being Sunday) as opposed to the 0-6 numbering used in standard cron.
For info visit:http://hortonworks.com/blog/new-in-hdp-2-more-powerful-scheduling-options-in-oozie/

Apache oozie is built to work with yarn and hdfs.
There are many features like data dependency, coordinator, workflow actions provided by oozie.
Oozie documentation
I think oozie is the best option
Sure you can use cron. But you will have to take lot of efforts to work with hadoop.

Related

Oozie Where does a custom EL function execute

I am writing a custom EL function which will be used in oozie workflows.
this custom function is just plain java code it doesn't contain any hadoop code.
My question is where will this EL function be executed at the time the workflow is running?
Will it execute my EL function on the Oozie node itself? or will it push my custom java code to one of the data nodes and execute it there?
Oozie is a workflow scheduler system to manage jobs in Hadoop Cluster it self, which integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Source
Which means if you submit a Job in Oozie, it will run in any of the available DataNode it self, even if your Oozie Service is configured in Datanode then it can run there as well.
For checking which Node the Job is processing, you have to check the same from JobTracker in Hadoop1 or Yarn in Hadoop2 which redirect the Process State to the Tasktracker node where the Job is being process
Acording to Apache Oozie: The Workflow Scheduler for Hadoop, page 177, it states:
It is highly recommended that the new EL function be simple, fast and
robust. This is critical because Oozie executes the EL functions on
the Oozie server
So It will be executed on your Oozie node itself.

Whats the best way to log in oozie

We are using oozie workflows with a oozie main class in the action. I am not really sure what is the best logging strategy. Should we just use log4j since it seems like that is the default strategy ? Do those logs get collected on the data nodes ?
Should we just use log4j since it seems like that is the default
strategy ?
I have not found any mention of someone using an alternative logger. It seems to be discouraged:
While Oozie can technically use any valid log4j Appender or
configurations that violate the above restrictions, certain features
related to logs may be disabled and/or not work correctly, and is thus
not advised.
Your other question:
Do those logs get collected on the data nodes ?
An SO answer mentions that
the logs are distributed across your cluster, but by logging them to
the rootLogger, you should be able to see them via the job tracker (by
drilling down on the Job task attempts).
You can inspect them via
use this to print last 10 lines
$ oozie job -oozie oozie_URL -log job_ID | tail -n 10

How to reschedule a coordinator job in OOZIE without restarting the job?

When i changed the start time of a coordinator job in job.properties in oozie, the job is not taking the changed time, instead its running in the old scheduled time.
Old job.properties:
startMinute=08
startTime=${startDate}T${startHour}:${startMinute}Z
New job.properties:
startMinute=07
startTime=${startDate}T${startHour}:${startMinute}Z
The job is not running at the changed time:07th minute,its running at 08th minute in every hour.
Please can you let me know the solution, how i can make the job pickup the updated properties(changed timing) without restarting or killing the job.
You can't really change the timing of the co-ordinator via any methods given by Oozie(v3.3.2) . When you submit a job the contents properties are stored in the database whereas the actual workflow is in the HDFS.
Everytime you execute the co-ordinator it is necessary to have the workflow in the path specified in properties during job submission but the properties file is not needed. What I mean to imply is the properties file does not come into the picture after submitting the job.
One hack is to update the time directly in the database using SQL query.But I am not sure about the implications of it.The property might become inconsistent across the database.
You have to kill the job and resubmit a new one.
Note: oozie provides a way to change the concurrency,endtime and pausetime as specified in the official docs.

Apache Mesos Workflows - Event Driven Scheduler

We are currently using Apache Mesos with Marathon and Chronos to schedule long running and batch processes.
It would be great if we could create more complex workflows like with Oozie. Say for example kicking of a job when a file appears in a location or when a certain application completes or calls an API.
While it seems we could do this with Marathon/Chronos or Singularity, there seems no readily available interface for this.
You can use Chronos' /scheduler/dependency endpoint to specify "all jobs which must run at least once before this job will run." Do this on each of your Chronos jobs, and you can build arbitrarily complex workflow DAGs.
https://airbnb.github.io/chronos/#Adding%20a%20Dependent%20Job
Chronos currently only schedules jobs based on time or dependency triggers. Other events like file update, git push, or email/tweet could be modeled as a wait-for-X job that your target job would then depend on.

Strict coordinator job ordering on Oozie

I have a coordinator on oozie that runs a series of tasks, each of which depends on the output of the last.
Each task outputs a dated folder and looks for the output of its predecessor using
${coord:latest(0)}
This all worked fine on my dev cluster when nothing else was running; every 5 minutes oozie would queue up another job, and in that 5 minutes the previous job had run so when the new job was set up it would see the directory it needed.
I run into problems on the production cluster; the jobs get submitted, but are put in a queue and don't run for a while, but still every 5 minutes oozie queues up another one, and in its initialization stage it is assigned its 'previous' folder, which hasn't been created yet as its predecessor hasn't run so the 'latest' function gives it the same input as the previous job. I then end up with 10 jobs all taking the same input...
What I need is a way of strictly preventing the next job in a coordinator sequence from even being created until its predecessor has finished running.
Is there a way this can be done?
Thanks for reading
This is the exact use case that Oozie designed to solve. Oozie will wait all data dependency before launch.
Please try to understand the following configs in your coordinator.xml
<datasets>
<dataset name="my_data" frequency="${coord:days(1)}" initial-instance="2013-01-27T00:00Z">
<uri-template>YOUR_DATA/${YEAR}${MONTH}${DAY}</uri-template>
</dataset>
...
<datasets>
<input-events>
<data-in name="my_data" dataset="my_data">
<instance>${coord:current(-1)}</instance>
</data-in>
</input-events>
<output-events>
<data-out name="my_data" dataset="my_data">
<instance>${coord:current(0)}</instance>
</data-out>
</output-events>
the "coord:current(-1)" in input-events means the previous output. It will interpret the dataset URI teamplate to "yesterday", and Oozie will check whether the data exist in HDFS by checking a success flag, which by default is an empty file named "_SUCCESS", right under the output directory. Oozie will keep waiting this flag before launching the current workflow.
btw, you can also set
<coordinator-app name="my_coordinator" frequency="${coord:days(1)}" start="${start_time}" end="${end_time}" ...>
to define start time and end time of a coordinator job, so you can catch up backlog data.

Resources