Create Oozie coordinator without end date with Hue - oozie

Can I create an infinite Oozie coordinator that doesn't expire (without end date) with Hue?
Please, help!

No. You can not create a coordinator which will never expire. Here is the Oozie Coordinator xsd reference:
<xs:attribute name="start" type="xs:string" use="required"/>
<xs:attribute name="end" type="xs:string" use="required"/>
The like start, end is also a required attribute.
You can created a coordinator which expires after a long time like 100 years. Technically that will be something like never expire only.

Related

Oozie trigger-file based coordinator. trigger file is not from hourly folders

I'm trying to create a coordinator with a dependency file - trigger file.
Frequency of my coordinator is 5 minutes. Timeout is 4 minutes.
My goal is that, the coordinator should execute the workflow only if the trigger file specified is created. In case trigger-file was not created coordinator should wait till the file is created and time-out at the end of 4th minute. The workflow deletes the trigger file once the workflow is triggered by coordinator. The trigger file comes whenever the source data is updated and so we have to run workflow again.
The trigger file may come multiple times a day, so I'm setting the coordinator frequency to 5 minutes. I have tried with the following code:
<coordinator-app name="transform_data_if_exists_coord" frequency="${freqMin}" start="${startDate}" end="${endDate}" timezone="${timeZone}" xmlns="uri:oozie:coordinator:0.1">
<controls>
<timeout>${timeOutMin}</timeout>
<concurrency>${concurrencyCount}</concurrency>
</controls>
<datasets>
<dataset name="input1" frequency="${coord:minutes(${freqMin})}" initial-instance="${startDate}" timezone="${timeZone}">
<uri-template>maprfs:////idn/home/deploy/inputdata/file</uri-template>
<done-flag>trigger</done-flag>
</dataset>
</datasets>
<action>
<workflow>
<app-path>/idn/home/deploy/triggerEmail/triggerEmail_wf.xml</app-path>
</workflow>
</action>
</coordinator-app>
With below properties:
startDate=2016-01-26T18:20Z
endDate=2017-01-17T06:00Z
timeZone=UTC
freqMin=5
timeOutMin=4
concurrencyCount=1
You need to add an input event and property WaitForThisInputData in action
<input-events>
<data-in name="check_for_input" dataset="input1">
<instance>${startTime2}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>/idn/home/deploy/triggerEmail/triggerEmail_wf.xml</app-path>
<configuration>
<property><name>WaitForThisInputData</name><value>${coord:dataIn('check_for_input')}</value></property>
</configuration>
</workflow>
</action
>
though my workflow is getting kicked off only when trigger file is present.
sometimes double workflow is getting kickedoff with same parent coordinator instance.
i.e. lets say if coordinator kicks off workflow at 25th instance of coordinator then work flow id contains parent as
0001212-xxxxxxx-oozie-mapr-C#25
but i see two workflows getting kicked off with same parent coordinator instance which is not ideal. Both the jobs are being kicked off at same timestamp.
Am I missing property. I've set the concurrency of coordinator to 1.
This issue is seen when trigger file comes after a significant gap in hours.

How to force coordinator action materialization at specific frequency?

I would like to know if it is possible/how to force a coordinator to materialize or instantiate workflow at regular intervals even if previous instantiated workflow are not done yet.
Let me explain:
I have a simple coordinator looking like this:
<coordinator-app name="myApp" frequency="${coord:hours(3)}" start="2015-01-01T0:00Z" end="2016-01-01T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.4">
<action>
<workflow>
<app-path>${myPath}/workflow.xml</app-path>
</workflow>
</action>
</coordinator-app>
The frequency is set to 3 hours. Every 3 hours, I expect the coordinator to "materialize" a new workflow instance/job.
Here is my problem: when the workflow execution lasts more than 3 hours, the coordinator doesn't materialize a new workflow instance but waits for the current running workflow to finish first. It will then instantiate the next workflow. Coordinator-launched workflows are queuing if they last more than the frequency.
How to make the coordinator start a new job every 3 hours no matter what ?
Thank you
You should use concurrency property. By default it is one, that's why you have issues with queuing. Set it as big as you find reasonable.
<coordinator-app name="[NAME]" frequency="[FREQUENCY]"
start="[DATETIME]" end="[DATETIME]" timezone="[TIMEZONE]"
xmlns="uri:oozie:coordinator:0.1">
<controls>
<concurrency>[CONCURRENCY]</concurrency>
</controls>
From docs:
Concurrency: A coordinator job can specify the concurrency for its coordinator actions, this is, how many coordinator actions are allowed to run concurrently ( RUNNING status) before the coordinator engine starts throttling them.

Which is the best scheduler for HADOOP. oozie or cron?

Can anyone please suggest which is best suited scheduler for Hadoop. If it is oozie.
How is oozie different from cron jobs.
Oozie is the best option.
Oozie Coordinator allows triggering actions when files arrive at HDFS. This will be challenging to implement anywhere else.
Oozie gets callbacks from MapReduce jobs so it knows when they finish and whether they hang without expensive polling. No other workflow manager can do this.
There are some benefits over crontab or any other, pointing some links
https://prodlife.wordpress.com/2013/12/09/why-oozie/
Oozie is able to start jobs on data availability, this is not free since someone has to say when the data are available.
Oozie allows you to build complex workflow using the mouse.
Oozie allows you to schedule workflow execution using the coordinator.
Oozie allows you to bundle one or more coordinators.
Using cron on hadoop is a bad idea but it's still fast, reliable, well known. Most of work which is free on oozie has to be coded if you are going to use cron.
Using oozie without Java means ( at the current date ) to meet a long list of dependency problem.
If you are a Java programmer oozie is a must.
Cron is still a good choice when you are in the test/verify stage.
Oozie separates specifications for workflow and schedule into a workflow specification and a coordinator specification, respectively. Coordinator specifications are optional, only required if you want to run a job repeatedly on a schedule. By convention you usually see workflow specifications in a file called workflow.xml and a coordinator specification in a file called coordinator.xml. The new cron-like scheduling affects these coordinator specifications. Let’s take a look at a coordinator specification that will cause a workflow to be run every weekday at 2 AM.
[xml]
<coordinator-app name="weekdays-at-two-am"
frequency="0 2 * * 2-6"
start="${start}" end="${end}" timezone="UTC"
xmlns="uri:oozie:coordinator:0.2">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
[/xml]
The key thing here is the frequency attribute in the coordinator-app element, here we see a cron-like specification that instructs Oozie when to run the workflow. The value for is specified in another properties file. The specification is “cron-like” and you might notice one important difference, days of the week are numbered 1-7 (1 being Sunday) as opposed to the 0-6 numbering used in standard cron.
For info visit:http://hortonworks.com/blog/new-in-hdp-2-more-powerful-scheduling-options-in-oozie/
Apache oozie is built to work with yarn and hdfs.
There are many features like data dependency, coordinator, workflow actions provided by oozie.
Oozie documentation
I think oozie is the best option
Sure you can use cron. But you will have to take lot of efforts to work with hadoop.

How can I handle published message from my input queue as well as publish it to remote endpoint?

I currently configured rebus to direct all messages in MyMessages assembly to the process managers input queue. Suppose in MyMessage assembly I have a message called SomethingHappened which will be used to trigger other actions in process managers (e.g. DoSomethingElse). However, I also want to subscribe SomethingHappened locally (in the same queue) to update the database. So the workflow will look like:
User trigger the action DoSomething in the UI
The handler of DoSomething updates the object state and publish events SomethingHappened internally (in memory collection of uncommitted events)
SomethingHappened events get published by rebus to the queue by go through all uncomitted events in the object
Handle SomethingHappened locally to update database
Handle SomethingHappened remotely by saga to trigger SomethingElseHappened
Is this possible to be configured in rebus?
The rebus configuration I currently have:
<rebus inputQueue="input" errorQueue="error" workers="1" maxRetries="5">
<endpoints>
<add messages="MyMessages" endpoint="processManagers.input"/>
</endpoints>
</rebus>
Thank You
Yin
My mistake. I should change the endpoint to configuration to input. And then never, never subscribe to commands. :)
<rebus inputQueue="input" errorQueue="error" workers="1" maxRetries="5">
<endpoints>
<add messages="MyMessages" endpoint="input"/>
</endpoints>
</rebus>

Cannot kill or suspend oozie coordinator job

I submitted oozie coordinator job under user "runner" when I try either kill or suspend I am getting following error message:
[runner#hadooptools ~]$ oozie job -oozie http://localhost:11000/oozie -kill 0000005-140722025226945-oozie-oozi-C
Error: E0509 : E0509: User [?] not authorized for Coord job [0000005-140722025226945-oozie-oozi-C]
From the logs on oozie server I see following message:
2014-07-25 03:10:07,324 INFO oozieaudit:539 - USER [runner], GROUP [null], APP [cron-coord], JOBID [0000005-140722025226945-oozie-oozi-C], OPERATION [start], PARAMETER [null], STATUS [SUCCESS], HTTPCODE [200], ERRORCODE [null], ERRORMESSAG
E [null]
Time to time even user under I issue the command is not logged correctly.
I am using CentOS 6.3 and Oozie Oozie client build version: 4.0.0.2.0.6.0-101, Oozie server build version: 4.0.0.2.0.6.0-101
I am not even able to stop it under the user oozie who runs the server. Under the user who submitted job I am not able to do suspend, kill, etc. I am able to just perform submit run which passes the flow or info.
Any hints/tricks or do I missconfigured something obvious?
UPDATE:
Security settings for the instance I am using.
<property>
<name>oozie.authentication.type</name>
<value>simple</value>
</property>
<property>
<name>oozie.authentication.simple.anonymous.allowed</name>
<value>true</value>
</property>
My conf/adminusers.txt contains:
# Admin Users, one user by line
runner
Hadoop core-site.xml
<property>
<name>hadoop.security.authentication</name>
<value>simple</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>users</value>
</property>
Where runner is a member of users group. According to Oozie documentation:
Oozie has a basic authorization model:
Users have read access to all jobs
Users have write access to their own jobs
Users have write access to jobs based on an Access Control List (list
of users and groups)
Users have read access to admin operations Admin
Users have write access to all jobs Admin users have write access to
admin operations
Did I overlooked something in configuration?
Do I need to specify/configure something like this:
Pseudo/simple authentication requires the user to specify the user name on the request, this is done by the PseudoAuthenticator class by injecting the user.name parameter in the query string of all requests. The user.name parameter value is taken from the client process Java System property user.name .
Old question, but eh, I got the same problem. Seems related to https://issues.apache.org/jira/browse/OOZIE-800
Just rm ~/.oozie-auth-token before issuing the oozie command solved it for me.
Temporarily resolved by disable security model.
Following setting disabled security model and then all worked as expected.
<property>
<name>oozie.service.AuthorizationService.security.enabled</name>
<value>false</value>
<description>
Specifies whether security (user name/admin role) is enabled or not.
If disabled any user can manage Oozie system and manage any job.
</description>
</property>
Will look deeper how to correctly solve this but as a temporary solution or for development this works fine.

Resources