I am writing a custom EL function which will be used in oozie workflows.
this custom function is just plain java code it doesn't contain any hadoop code.
My question is where will this EL function be executed at the time the workflow is running?
Will it execute my EL function on the Oozie node itself? or will it push my custom java code to one of the data nodes and execute it there?
Oozie is a workflow scheduler system to manage jobs in Hadoop Cluster it self, which integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Source
Which means if you submit a Job in Oozie, it will run in any of the available DataNode it self, even if your Oozie Service is configured in Datanode then it can run there as well.
For checking which Node the Job is processing, you have to check the same from JobTracker in Hadoop1 or Yarn in Hadoop2 which redirect the Process State to the Tasktracker node where the Job is being process
Acording to Apache Oozie: The Workflow Scheduler for Hadoop, page 177, it states:
It is highly recommended that the new EL function be simple, fast and
robust. This is critical because Oozie executes the EL functions on
the Oozie server
So It will be executed on your Oozie node itself.
Related
I would like to automate some of my tasks using Apache Kafka. Previously i
used to do the same using Apache Airflow and which worked fine. But i want
to explore the same using Kafka whether this works better than Airflow or
not.
Kafka runs on Server A
Kafka searches for a file named test.xml on Server B, here kafka search
for every 10 or 20 mins whether this file created or not.
Once kafka sense the file created, then the job starts as follows
a)Create a jira ticket and update all the executions on jira for each
events
b) Trigger a rsync command
c) Then unarchive the files using tar command
d) Some script to execute using the unarchive files
e) Then archive the files and rsync to different location
f) Send email once all task finished
Please advise if this is something kafka intelligent to begin with? Or if
you have any other open source products which can do this actions , please
let me know. By the way i prefer to setup these on docker-compose based
installation.
Or please suggest, what are the best opensource tools available for this automation purpose
Thanks
Avoid using kafka for the usecases you mentioned. Kafka is not good for defining dags or workflows. It works great for streaming usecase(data in motion).
Airflow would allow to define dags with multiple task.
You can leverage filesystem sensor[https://github.com/apache/airflow/blob/main/airflow/sensors/filesystem.py] to check if file/folder has been updated/created.
And post that you can leverage python operators and other operators(hooks) to achieve all other task as well.
I need to initialise my Corda nodes by running a few flows to create certain states.
At the moment I am doing it via the CRaSH shell.
e.g.
flow start IOUFlow iouValue: 50, counterparty: Bank1
Is it possible to have the node run a script or some commands on node startup to do this automatically?
If not, how can I write a bash script to automate these CRaSH commands?
Corda 4.4 introduces a new feature to register actions to be performed on node startup.
You could register an action to be performed on node startup using a CordaService.
appServiceHub.register(
AppServiceHub.SERVICE_PRIORITY_NORMAL,
event -> {
// Your custom code to be run on startup.
}
);
You might want to check on the event type to keep it future proof, but currently the ServiceLifecycleEvent just has a single STATE_MACHINE_STARTED enum.
When an oozie launcher spawns another hadoop job, is there any way to get the application ID, or even better the resource manager link, to that spawned application? It seems like the oozie launcher is only aware of its own id.
This is with a Spark action.
You can use below inbuilt oozie EL function to get the application Id.
wf:actionExternalId(String node)
More details on available EL functions here: http://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a4.2_Expression_Language_Functions
I just deployed a oozie job. Now when I go to the oozie web ui ... i just cannot see the job I deployed.
Is there a command line tool which will allow me to do two things
List all the jobs which are deployed (not running, active, killed)... but deployed. like an inventory of all jobs.
execute a job from the command line. (on demand, not based on schedule).
As already mentioned in one of the comments that ooziein itself is a command line tool.
Therefore, to answer both of your questions:
List all jobs
For listing all workflow jobs use the following command
oozie jobs
oozie jobs -jobtype wf
For listing all coordinator jobs use the following command from the console
oozie jobs -jobtype coordinator
Execute a job from command line
oozie job --oozie http://oozie-url:11000/oozie -config job.properties -run
Mind you that if you want to keep on starting on demand job either you have to run the above command manually (also only for workflow jobs, not for coordinator jobs as coordinator jobs would be scheduled according to the schedule you define) or put in a shell script so that it is triggered under certain situations
For more info check this link - Oozie_Command_Line_Usage
We are currently using Apache Mesos with Marathon and Chronos to schedule long running and batch processes.
It would be great if we could create more complex workflows like with Oozie. Say for example kicking of a job when a file appears in a location or when a certain application completes or calls an API.
While it seems we could do this with Marathon/Chronos or Singularity, there seems no readily available interface for this.
You can use Chronos' /scheduler/dependency endpoint to specify "all jobs which must run at least once before this job will run." Do this on each of your Chronos jobs, and you can build arbitrarily complex workflow DAGs.
https://airbnb.github.io/chronos/#Adding%20a%20Dependent%20Job
Chronos currently only schedules jobs based on time or dependency triggers. Other events like file update, git push, or email/tweet could be modeled as a wait-for-X job that your target job would then depend on.