Can oozie control jobs outside of Hadoop? - oozie

From documentation, it isn't very clear whether oozie can schedule and control jobs outside of Hadoop? Can someone shed some light on this? If not, is there any open source based workflow engine which can do that?

Try consider using chronos (from airbnb) advanced version of cron with a UI, built on top of mesos. airbnb.github.com/chronos/
Cheers.

I believe no. Because Oozie itself does not have a resource management policy, all it does is submitting jobs to Hadoop's job tracker at the right time. Besides, for each Oozie workflow, there will be one launcher job which is responsible for submitting the real jobs in the workflow to Hadoop. The launcher job is itself a Hadoop job. So, I think for the versions earlier than Oozie 3.2, the answer should be no.

You might consider trying azkaban by linked in. It was specifically built for hadoop. But unix commands can be specified in the job file of azkaban. So you may develop a workflow for any application(s) that can be run using command line.

I've been working on a new workflow engine called Soop. https://github.com/radixCSgeek/soop it is very lightweight and simple to setup and run using a cron-like syntax. It can run any Java POJO as well as running shell processes, so you can kick off a bash script or whatever.

Related

Can you specify the number of threads for certain tasks in a DAG?

I'm very new to Airflow and while I have read the docs and some answers about Airflow's configuration regarding parallelism, it seems I have not yet found the answer to specifying threads used in a task.
My current case is I have 5 tasks (in the form of a Python script) that only do API calls (but to different API service) and transform the data. For each task I can make up to 1000+ calls, so I try to utilize multithreading in the script. Unfortunately, when I try to run the multithreaded script in Airflow, it doesn't use the multithreading mechanism in the script. I feel like this is because of Airflow configuration that overrides the child script or am I wrong? Any help or answer is appreciated, thank you.
Run your script with a KubernetesPodOperator.
You can use a python base image and run your script as is. This should closely mimic how you are executing the script locally except now it's done in a kubernetes pod.

Editing Oozie Workflow after submitting through Command Line

After running an oozie workflow using the command line I am unable to edit it using the Hue Workflow editor or even find it in the list of workflows.
I have an R script that generates the workflow.xml and job.properties, and will run the commands necessary to run the scripts:
workflow.path <- "workflow.xml" # Hard coded for the sake of this example
system2("hadoop", args = c("fs -put -f ", workflow.path, "/User/service/Test/" ))
system("oozie job --oozie http://localhost:11000/oozie -config job.properties -run")
Moving the workflow into HDFS works fine, and I have verified it is a valid workflow using oozie. Running the job also works like a charm, however if I open up Hue, and navigate to the Workflow and find it, I cannot edit it, only rerun it.
Some background on what I am trying to do: We have a large amount of automated workflows and we are always adding more. They all follow the same pattern as well so automating the creation of the coordinator and workflows is simple. Sometimes these workflows have to modified by people and they need to be able use the web interface.
Any help would be appreciated.
Indeed, only workflows created via the Drag&Drop Editor can be edited.
Workflows submitted via the CLI can only be visualized.

Unix scripting for servces to check

I am struggling to write a script to check a particular service running on my server and then send me mails.
So should this script be the part of bash profile so that its always running..
regards
rick
The .profile, .bashrc and friends will be run on login, so they are of no good use for background monitoring. Two solutions come to mind:
Either use cron to run your script at predefined intervals
Or make it loop and use your system's init environment (SysV, Upstart, SystemD, ...) to control it
My recommendation is to stick with cron - it even makes the mailing of results dead easy - just create output.

How to trigger an oozie workflow job with Bamboo?

I am new to bamboo. I know in general how to trigger an oozie workflow job in CDH env. Could someone please suggest some good documentation which describes this?
In Bamboo I have just created a plan which does the code build pointing to my repository each time I check in. Now I need to know - how can I trigger a workflow job from bamboo?
I understand that this should be some kind of command which needs to trigger from bamboo to execute. Please, suggest
You could use oozie command line interface from script execution step.
Install oozie client on the machine with Bamboo (or on any other and ssh it). On CDH cluster machines it should be preinstalled.
Create bamboo script task (Details: Bamboo Script)
Run start job command in script task, e.g. oozie job -oozie http://localhost:8080/oozie -start 14-20090525161321-oozie-joe. See details on cli usage here

Quartz.Net (2.2.3) Scheduling New Jobs

I am running the Quartz.Net server as a Windows service, like described in the documentation. I am trying to understand how I can create new jobs for Quartz to schedule, without the need to rebuild the Quaretz.net server application everytime.
I would like to be able to add new jobs from an exe, dll, or other options welcome. This way I can add jobs dynamically. From what I can tell it seems all jobs must be defined up front and built into the server. From there the user can pass parameters and enable triggers via XML file. I am using MS SQL Server instead of XML file for persistence layer.
My use case is I need to generate reports at particular times, but the users can create new reports after launch of my application. I am using Dev Express for my reporting (not sure if this matters).
Any guidance is very appreciated.
You should check out the work Tolis Bekiaris did on the eXpand Framework's JobScheduler. It's a module for DevExpress's XAF and Quartz.NET which should give you plenty of sample code, especially if you are already using XPO for your data.
You can get the source code here.
Or alternatively, it's on Github.
You'll find the job scheduler code in eXpand/Xpand/Xpand.ExpressApp.Modules/JobScheduler.

Resources