Apache oozie under Hue - oozie

I need a step-by-step documentation to set up a workflow scheduler with oozie under hue with the configuration or parameterization steps.
I have a school project: "Workflows for the Big Data" or one asks me to use oozie for the scheduling of tasks in hadoop I do not know at all. After searching on the internet (site of apache oozie, site of hue and documentation) and in some books I do not find satisfactory result.
However I defined some job in xml files but when I try to systematically oozie it kills the job. I'm new to the forum and big data

Related

Job Sensors in Databricks Workflows

At the moment we schedule our Databricks notebooks using Airflow. Due to dependencies between projects, there are dependencies between DAGs. Some DAGs wait until a task in a previous DAG is finished before starting (by using sensors).
We are now looking to use Databricks DBX. It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX. My question is now, is it possible to add dependencies between Databricks jobs? Can we create 2 different jobs using DBX, and make the second job wait until the first one is completed.
I am aware that I can have dependencies between tasks in one job, but in our case it is not possible to have only one job with all the tasks.
I was thinking about adding a notebook/python script before the wheel with ETL logic. This notebook would check then if the previous job is finished. Once this is the case, the task with the wheel will be executed. Does this make sense, or are there better ways? Is something like the ExternalTaskSensor in Airflow available within Databricks workflows?
Or is there a good way to use DBX without DB workflows?
author of dbx here.
TL;DR - dbx is not opinionated in terms of the orchestrator choice.
It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX.
The short answer is yes, but it's done on the tasks level (read more here on the difference between workflow and task).
Another approach would be the following - if you still need (or want) to use Airflow, you can do it in the following way:
Deploy and update your jobs from your CI/CD pipeline with dbx deploy commands.
In Airflow, use the Databricks Operator to launch the job (either by name or by id).

Editing Oozie Workflow after submitting through Command Line

After running an oozie workflow using the command line I am unable to edit it using the Hue Workflow editor or even find it in the list of workflows.
I have an R script that generates the workflow.xml and job.properties, and will run the commands necessary to run the scripts:
workflow.path <- "workflow.xml" # Hard coded for the sake of this example
system2("hadoop", args = c("fs -put -f ", workflow.path, "/User/service/Test/" ))
system("oozie job --oozie http://localhost:11000/oozie -config job.properties -run")
Moving the workflow into HDFS works fine, and I have verified it is a valid workflow using oozie. Running the job also works like a charm, however if I open up Hue, and navigate to the Workflow and find it, I cannot edit it, only rerun it.
Some background on what I am trying to do: We have a large amount of automated workflows and we are always adding more. They all follow the same pattern as well so automating the creation of the coordinator and workflows is simple. Sometimes these workflows have to modified by people and they need to be able use the web interface.
Any help would be appreciated.
Indeed, only workflows created via the Drag&Drop Editor can be edited.
Workflows submitted via the CLI can only be visualized.

Oozie editor in HUE doesn't show workflows I submited by oozie command

I submited a workflow and run the job by oozie command:
oozie job -oozie http://node1:11000/oozie -config job.properties -submit
oozie job -oozie http://node1:11000/oozie -start [job_id]
it worked well.
And I wanted to edit the workflow by oozie editor in HUE but couldn't find it. What shall I do to make the workflow shown in oozie editor?
CDH version: 5.9
I would not recommend o use HUE for editing workflows. As it uses set of metadata stored in DB and owned by itself it will not allow you to introduce changes into workflow submitted from the outside.
Hue is pretty nice tool for prototyping and monitoring the workflow/coordinator progress, checking the logs and having a first view in case of any issue investigation. You can import your xml and allow editing via GUI - http://blog.cloudera.com/blog/2013/03/how-to-import-a-pre-existing-oozie-workflow-into-hue/ , but this is hard to manage and maintain in production like environments. For example running multiple instances of the workflow with various parameters.

How to trigger an oozie workflow job with Bamboo?

I am new to bamboo. I know in general how to trigger an oozie workflow job in CDH env. Could someone please suggest some good documentation which describes this?
In Bamboo I have just created a plan which does the code build pointing to my repository each time I check in. Now I need to know - how can I trigger a workflow job from bamboo?
I understand that this should be some kind of command which needs to trigger from bamboo to execute. Please, suggest
You could use oozie command line interface from script execution step.
Install oozie client on the machine with Bamboo (or on any other and ssh it). On CDH cluster machines it should be preinstalled.
Create bamboo script task (Details: Bamboo Script)
Run start job command in script task, e.g. oozie job -oozie http://localhost:8080/oozie -start 14-20090525161321-oozie-joe. See details on cli usage here

Can oozie control jobs outside of Hadoop?

From documentation, it isn't very clear whether oozie can schedule and control jobs outside of Hadoop? Can someone shed some light on this? If not, is there any open source based workflow engine which can do that?
Try consider using chronos (from airbnb) advanced version of cron with a UI, built on top of mesos. airbnb.github.com/chronos/
Cheers.
I believe no. Because Oozie itself does not have a resource management policy, all it does is submitting jobs to Hadoop's job tracker at the right time. Besides, for each Oozie workflow, there will be one launcher job which is responsible for submitting the real jobs in the workflow to Hadoop. The launcher job is itself a Hadoop job. So, I think for the versions earlier than Oozie 3.2, the answer should be no.
You might consider trying azkaban by linked in. It was specifically built for hadoop. But unix commands can be specified in the job file of azkaban. So you may develop a workflow for any application(s) that can be run using command line.
I've been working on a new workflow engine called Soop. https://github.com/radixCSgeek/soop it is very lightweight and simple to setup and run using a cron-like syntax. It can run any Java POJO as well as running shell processes, so you can kick off a bash script or whatever.

Resources