Get oozie job information in oozie workflow by REST - oozie

How can I find jobs with the parent id is null? I tried 3 methods but none of them worked for me.
/oozie/v1/jobs?jobtype=wf&filter=parent_id=%00 NOT WORKING
/oozie/v1/jobs?jobtype=wf&filter=parent_id=null NOT WORKING
/oozie/v1/jobs?jobtype=wf&external-id=null NOT WORKING

Oozie does not support filtering jobs for parent_id yet. See the current possibilities for filtering options in the Oozie documentation.
There are plans (since 2014) to add support, see the open jira: OOZIE-1779

Related

Airflow DAG Versioning

Is DAG versioning a thing ? I can't find much on the subject with a few Google searches. I would like to look at the DAGs screen in Airflow and be sure of what DAG code is in the wild.
The simplest solution would be to include a version number as part of the dag_id, but I would appreciate knowing if anyone has better, alternative solution. Tags would work too and migjht look good in the UI - they are designed for for filtering though, I'm not sure if there would be undesirable side-effects.
As the author of the DAG Versioning AIP, I can say that this work has been deferred post 2.0 mainly to support end-to-end DAG Versioning.
Originally, we (Airflow Core Committers) were planning to have a Webserver-only DAG Versioning i.e. to improve the visibility behaviour but not execution:
The scope of this AIP to make sure that the visibility behavior of
Airflow is correct, without changing the execution behaviour which
will continue to be based on the most recent version of the DAG.
This means it overcomes the issues where you can go back to an old version of the DAG, to view the shape of the DAG few months back and you can see the correct representation instead of "always-latest".
Currently, Airflow suffers from the issue where if you add/remove a task, it gets added/removed in all the previous DagRuns in the Webserver.
However, what we have decided is that we will accomplish Remote DAG Fetcher + DAG Versioning and enable versioning of DAG on the worker side, so a user will be able to run a DAG with the previous version too.
Currently, we don't have a date but mostly planning to do it around the end of 2021.
The Airflow project has a draft feature open to support DAG versions. The answer currently is that Airflow does not support versions.
The first use case in the link describes a key limitation, log files from previous runs can only surface nodes from the current DAG.
As mentioned above, as of yet, Airflow doesn't has its own functionality of versioning workflows. However you can manage that on your own by managing DAGs on their own git repository and fetching its state into airflow reposiroty as submodules. More on that;
https://www.youtube.com/watch?v=a-4yRne3ba4&lc=UgwiIO-ECVFSZPz1hOt4AaABAg

Oozie editor in HUE doesn't show workflows I submited by oozie command

I submited a workflow and run the job by oozie command:
oozie job -oozie http://node1:11000/oozie -config job.properties -submit
oozie job -oozie http://node1:11000/oozie -start [job_id]
it worked well.
And I wanted to edit the workflow by oozie editor in HUE but couldn't find it. What shall I do to make the workflow shown in oozie editor?
CDH version: 5.9
I would not recommend o use HUE for editing workflows. As it uses set of metadata stored in DB and owned by itself it will not allow you to introduce changes into workflow submitted from the outside.
Hue is pretty nice tool for prototyping and monitoring the workflow/coordinator progress, checking the logs and having a first view in case of any issue investigation. You can import your xml and allow editing via GUI - http://blog.cloudera.com/blog/2013/03/how-to-import-a-pre-existing-oozie-workflow-into-hue/ , but this is hard to manage and maintain in production like environments. For example running multiple instances of the workflow with various parameters.

Apache oozie under Hue

I need a step-by-step documentation to set up a workflow scheduler with oozie under hue with the configuration or parameterization steps.
I have a school project: "Workflows for the Big Data" or one asks me to use oozie for the scheduling of tasks in hadoop I do not know at all. After searching on the internet (site of apache oozie, site of hue and documentation) and in some books I do not find satisfactory result.
However I defined some job in xml files but when I try to systematically oozie it kills the job. I'm new to the forum and big data

Seeing what tickets were closed between two builds

How can I find out which tickets were closed between one build and the previous stable build? I'm trying to design a new build process, so I'm not set on particular tools yet. Which ones would let me see this sort of info in a dashboard, if any? Should I try to do this from a bug tracker, or from a build pipeline such as Jenkins or Bamboo, or somewhere else?
A possible set-up is to:
include the bug-tracker issue ID in your commit messages in your SCM ("[MYPROJECT-12923] add this new option in that nice feature")
launch your build with Jenkins that retrieves the source code from your SCM. Jenkins will show you a "Recent Changes" label linking to a page where you will find the commits that took place between last build and the current one. The commit messages will include the list of issues ID included in the build.
Note: that maybe does not answer your question perfectly because those commits could be intermediate one. Depends also on how granular are the commits.
In our DEV team, all the commits have the JIRA number in them (This is enforced by a plugin called TicketIt in Stash . There are various other plugins available for different repositories) . When we run a build , all commits that are part of the build are aggregrated by teamcity and displayed on a tab called issues . This solution that I am proposing works in teamcity and bamboo . I am sure thee would be some plugin with Jenkins for the same.
A hackier way would be to get the start time of the last build (x) and the current build(y) and get all JIRA tickets that were closed during this time via JIRA API. This might not be a foolproof method if your JIRA's are not always closed before a build

Can oozie control jobs outside of Hadoop?

From documentation, it isn't very clear whether oozie can schedule and control jobs outside of Hadoop? Can someone shed some light on this? If not, is there any open source based workflow engine which can do that?
Try consider using chronos (from airbnb) advanced version of cron with a UI, built on top of mesos. airbnb.github.com/chronos/
Cheers.
I believe no. Because Oozie itself does not have a resource management policy, all it does is submitting jobs to Hadoop's job tracker at the right time. Besides, for each Oozie workflow, there will be one launcher job which is responsible for submitting the real jobs in the workflow to Hadoop. The launcher job is itself a Hadoop job. So, I think for the versions earlier than Oozie 3.2, the answer should be no.
You might consider trying azkaban by linked in. It was specifically built for hadoop. But unix commands can be specified in the job file of azkaban. So you may develop a workflow for any application(s) that can be run using command line.
I've been working on a new workflow engine called Soop. https://github.com/radixCSgeek/soop it is very lightweight and simple to setup and run using a cron-like syntax. It can run any Java POJO as well as running shell processes, so you can kick off a bash script or whatever.

Resources