I just deployed a oozie job. Now when I go to the oozie web ui ... i just cannot see the job I deployed.
Is there a command line tool which will allow me to do two things
List all the jobs which are deployed (not running, active, killed)... but deployed. like an inventory of all jobs.
execute a job from the command line. (on demand, not based on schedule).
As already mentioned in one of the comments that ooziein itself is a command line tool.
Therefore, to answer both of your questions:
List all jobs
For listing all workflow jobs use the following command
oozie jobs
oozie jobs -jobtype wf
For listing all coordinator jobs use the following command from the console
oozie jobs -jobtype coordinator
Execute a job from command line
oozie job --oozie http://oozie-url:11000/oozie -config job.properties -run
Mind you that if you want to keep on starting on demand job either you have to run the above command manually (also only for workflow jobs, not for coordinator jobs as coordinator jobs would be scheduled according to the schedule you define) or put in a shell script so that it is triggered under certain situations
For more info check this link - Oozie_Command_Line_Usage
Related
I need to restart Airflow. I want to make sure I do it when it's idle, so I that I don't interrupt a job by restarting the worker component of Airflow.
How do I see what DAGs are running?
I don't see anything in the UI that would list currently running DAGs.
I don't see any command in the airflow CLI to list currently running DAGs.
I found airflow shell that lets me connect to the DB, but I don't know enough about Airflow internals to know where to look to see what's running.
You can also query the database to get all running tasks at once:
select * from task_instance where state='running'
You can use the command line command airflow jobs check which would return "No alive jobs found." in the event no jobs are running.
I found it... it's in the UI, on the DAGs page, it's the second circle under "Recent Tasks":
Is there any Oozie action that can be used in workflow.xml to purge the logs generated by oozie by 2 days from the oozie job excution?
you can use -
log4j.appender.oozie.RollingPolicy.FileNamePattern=${log4j.appender.oozie.File}-%d{yyyy-MM-dd-HH}
log4j.appender.oozie.RollingPolicy.MaxHistory=720
Settings to define the amount of time you keep your logs. The older logs will be gzipped and after a retention period, they would be deleted.
I am writing a custom EL function which will be used in oozie workflows.
this custom function is just plain java code it doesn't contain any hadoop code.
My question is where will this EL function be executed at the time the workflow is running?
Will it execute my EL function on the Oozie node itself? or will it push my custom java code to one of the data nodes and execute it there?
Oozie is a workflow scheduler system to manage jobs in Hadoop Cluster it self, which integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Source
Which means if you submit a Job in Oozie, it will run in any of the available DataNode it self, even if your Oozie Service is configured in Datanode then it can run there as well.
For checking which Node the Job is processing, you have to check the same from JobTracker in Hadoop1 or Yarn in Hadoop2 which redirect the Process State to the Tasktracker node where the Job is being process
Acording to Apache Oozie: The Workflow Scheduler for Hadoop, page 177, it states:
It is highly recommended that the new EL function be simple, fast and
robust. This is critical because Oozie executes the EL functions on
the Oozie server
So It will be executed on your Oozie node itself.
We are currently using Apache Mesos with Marathon and Chronos to schedule long running and batch processes.
It would be great if we could create more complex workflows like with Oozie. Say for example kicking of a job when a file appears in a location or when a certain application completes or calls an API.
While it seems we could do this with Marathon/Chronos or Singularity, there seems no readily available interface for this.
You can use Chronos' /scheduler/dependency endpoint to specify "all jobs which must run at least once before this job will run." Do this on each of your Chronos jobs, and you can build arbitrarily complex workflow DAGs.
https://airbnb.github.io/chronos/#Adding%20a%20Dependent%20Job
Chronos currently only schedules jobs based on time or dependency triggers. Other events like file update, git push, or email/tweet could be modeled as a wait-for-X job that your target job would then depend on.
I have a coordinator on oozie that runs a series of tasks, each of which depends on the output of the last.
Each task outputs a dated folder and looks for the output of its predecessor using
${coord:latest(0)}
This all worked fine on my dev cluster when nothing else was running; every 5 minutes oozie would queue up another job, and in that 5 minutes the previous job had run so when the new job was set up it would see the directory it needed.
I run into problems on the production cluster; the jobs get submitted, but are put in a queue and don't run for a while, but still every 5 minutes oozie queues up another one, and in its initialization stage it is assigned its 'previous' folder, which hasn't been created yet as its predecessor hasn't run so the 'latest' function gives it the same input as the previous job. I then end up with 10 jobs all taking the same input...
What I need is a way of strictly preventing the next job in a coordinator sequence from even being created until its predecessor has finished running.
Is there a way this can be done?
Thanks for reading
This is the exact use case that Oozie designed to solve. Oozie will wait all data dependency before launch.
Please try to understand the following configs in your coordinator.xml
<datasets>
<dataset name="my_data" frequency="${coord:days(1)}" initial-instance="2013-01-27T00:00Z">
<uri-template>YOUR_DATA/${YEAR}${MONTH}${DAY}</uri-template>
</dataset>
...
<datasets>
<input-events>
<data-in name="my_data" dataset="my_data">
<instance>${coord:current(-1)}</instance>
</data-in>
</input-events>
<output-events>
<data-out name="my_data" dataset="my_data">
<instance>${coord:current(0)}</instance>
</data-out>
</output-events>
the "coord:current(-1)" in input-events means the previous output. It will interpret the dataset URI teamplate to "yesterday", and Oozie will check whether the data exist in HDFS by checking a success flag, which by default is an empty file named "_SUCCESS", right under the output directory. Oozie will keep waiting this flag before launching the current workflow.
btw, you can also set
<coordinator-app name="my_coordinator" frequency="${coord:days(1)}" start="${start_time}" end="${end_time}" ...>
to define start time and end time of a coordinator job, so you can catch up backlog data.