In Oozie, like Actions have retry options, are there similar options at the Workflow level? - oozie

We have an Oozie Workflow with some Spark, Hive, and SSH actions. Occasionally the workflow fails due to some singular issues. And pretty much always the failed instance succeeds upon re-running it. However, I couldn't find any automatic retry options at the workflow or coordinator level.
I did see that Actions have retry options like number of retry attempts and after how long to retry. It'll help as a workaround for now - but it got me wondering if workflows really don't have any such options?
The Workflows & Coordinators are created and maintained using Hue (3.12) editor - not directly through the XML file.

Currently, there is no such option in Oozie. However, Oozie's workflow definition language allows you to create loops using subworkflow actions and decision nodes.

Related

Serve web request in python that spawns a new long running subprocess

I currently have a python command line application that uses python invoke package to organise, list and execute tasks. There are many task files (controlled & created by users, not me). Execution time for some task files can be more than an hour. Each task is actually a test script/program. invoke is useful in listing/executing all the tasks in a task file (we call it a testsuite) or only a bunch of them (a tasks collection) or a single task. (Having a ton of loose scripts and organising, listing & running them in the way users want would be quite a task, hence invoke).
However, invoke cannot be used as a library. It does not offer an API that can be leveraged to list and run test tasks. So I am forced to run invoke as a shell command in subprocess from command line program. I replace (via execl()) the current process with invoke because once the control passes to invoke, there is no need to come back to parent process. So far good..
Now, there is a requirement that this command line program be callable from a web application. So I need to wrap this cmdline program in a restful http API. I've decided to use bottle.py to keep things simple.
I understand that the long running testsuite (tasks) will have to be done off the http request/response cycle. But I'm unable to finalise exactly how to go about it (prob. I may be overthiniking). But here is what I want ...
Tasks are written by users. They are always synchronous, they may sleep or execute shell commands via subprocess.run().
Application is internal, it will not be bombarded with huge number of requests. No of users Max. 10.
But each request (of type that runs the task) will take minutes and some cases > hour to complete. New requests during this should not block.
Calling application (running on a different host) will need to report progress of the running task to the browser UI. ('progress bar')
Ability to communicate with running task and 'cancel' it from browser UI.
With above situation, am I correct in saying ..
because a new 'process' must be spawnned (due use of subprocess and excl in current code) for a request, it rules out using 'threads' of any type (os threads, greenlets, gevent)?
Using any async libraries (web framework, web/http server or in app code) won't be of much help, because every run request will have to be a new process anyway?
How will the process be spawned when a request comes in? Let the web/htpp server (gunicorn?) do it? or My application has to take case of forking itself?
is 'gunicorn' a good choice for this situation?
I have a feeling that users may also ask for the ability to schedule tasks/tests. I might end up using some sort of task queue. I have read 'huey' and feel that it is light & simple for my needs. (No redis/Celery). But any task queue also means a separate consumer process to administer? More moving parts to the mix.
'progress-bar' functionality means, subprocess has to keep updating its progress somewhere and calling application has to read from there. Does this necessitate 'task queue' anyway?
There is a lot of material on all of this and I have read quite some if it. But it still has left me unclear as to how exactly to go about implementing my requirements. Any direction/pointers would be appreciated. I'd also appreciate any advice on what 'not to use'.
If you need something really simple then you could write a wrapper around task spooler (linux tool to run tasks) https://vicerveza.homeunix.net/~viric/soft/ts/ (especially https://vicerveza.homeunix.net/~viric/soft/ts/article_linux_com.html for more details)
Otherwise it's probably better to switch to uwsgi spooler, rq with redis or celery with rabbitmq (cause with redis it works to certain extent).

How do I configure a flow to send to multiple PODs in K8S?

I have single camunda job that is configured as a multi-instance call to another process. At present, multi instance asynchronous before, multi instance asynchronous after, and multi instance exclusive are all checked. We have multiple PODs deployed to handle the calls(1k at a time) and right now when I try to run this, it seems like no matter what I am doing, it is running them serially, or close to it. What is needed to actually send all 1000 elements to multiple instances of the child process?
Tried configuring the multi instance asynch settings
Multi Instance
Loop Cardinality-blank
Collection-builtJobList
Element Variable-builtRequestObject
I then have all three multi instance values checked. The Asynch Continuations are not checked.
Camunda BPM will only run a single thread (execution) within a given process instance at a time by default. You can change that behavior for a given task/activity by checking the "Asynchronous Before" and/or "Asynchronous After" checkboxes - thus electing to use the Job Executor - and deselecting the "Exclusive" checkbox. (This also applies to the similar checkboxes for multi-instance activities.) If you do that, beware that the behavior may not be what you want; specifically:
You will likely receive OptimisticLockingExceptions if you have a decent number of threads running simultaneously on a single instance. These are thrown when the individual threads attempt to update the information in the relational database for the process instance and discover that the data has been modified while they were performing their processing.
If those OptimisticLockingExceptions occur, the Job Executor will automatically retry the Job without decrementing the available retries. This will re-execute the Job, re-executing any included integration logic as well. This may not be desirable.
Although Camunda BPM has been proven to be fantastic at executing large numbers of process instances in parallel, it isn't designed to execute multiple threads simultaneously within an individual process instance. If you want that behavior within a given process instance, I would suggest that you handle the threading yourself within an individual Service Task, fire-and-forget launching the threads you need and letting the Service Task complete within Camunda immediately after launching them... of course if that's feasible given your application's desired behavior.

activiti taskService complete fails when executed concurrently

Hi I am facing a strange situation where I am trying to set a set of tasks as complete all concurrently.
The first one goes through and second one goes through sometimes (rarely) but mostly doesnt go through.
When I do these individually they work.
Something to do with database locking I feel. Is there some workaround or code for executing task and variable updates concurrently ?
Do they belong to the same process instance?
And yes, there will be a db locking mechanism in place, because when you complete each task a process instance will need to move forward.
Can you please clarify what are you trying to solve? what is your business scenario?
Cheers
Activiti uses pre-emptive locking and this can cause problems for parallel tasks.
Typically if you use the "exclusive" flag the problems go away (https://www.activiti.org/userguide/#exclusiveJobs).
Keep in mind that jobs never actually run in parallel, the job engine selects jobs to run and if there are multiple they will be run sequentially (which appears to be parallel to the user).

How TaskTrackers informs Jobtrackers about their state?

I read about the Apache Hadoop. They said that in HDFS, tasks are any process, that is, mapper or reducer. And they together called jobs.
They have two things, JOBTRACKER, and TASKTRACKER , tasktracker is on each node that manages mapper or reducer tasks.
And, Jobtracker is the one, who manges all task-trackers.
Till now I understand all the concpts theoretically, and all the things are well explained in many blogs.
But I have one doubt, how tasktracker inform jobtracker that given task fail. How they communicate each other. Is they using any other software just like , Apache AVRO.
Please explain me the internal mechanism of this.
Looking for your kind reply.
AVRO has nothing to do with this. It is just a serialization framework, which folks usually use if they feel that Hadoop's serialization is not helping them much. Otherwise it is just another member of the Hadoop ecosystem.
Coming to your original question, it is done through heartbeats, as #thiru_k has specified above. But along with the number of available slots heartbeat signals contains some other info as well, like job status, resource usage, etc. Tasks which don't report their progress for a while are marked as hung or killed. I would suggest you to go through this link, it'll answer all your questions.
The TaskTrackers sends out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated

Create workflow service instances for large number of records at once

I’m working on a business problem which has to import files which has 1000s of records. Each record has to be registered in a Workflow as individual record which has to go through its own workflow.
WF4 Corporate Purchase Process example has a good solution, as in the first step it create bookmarks for all the required record ids. So the workflow can be resumed with rest of the actions for each individual record/id.
I would like to know how to implement same thing using Workflow services as I could get the benefits of AppFabric for my workflows.
Is there any other solutions to handle batch of records/ids? Otherwise workflow service has to be called 1000s of times just to register every record in a workflow instance which is a not a good solution.
I would like to know how to implement same thing using Workflow services as I could get the benefits of AppFabric for my workflows.
This is pretty straight forward. You're going to have one workflow that reads the file and loops through the results using the looping activities that exist. Then, inside the loop you'll be starting up the workflow that each record needs (the "Service") by calling the endpoint with a Send activity.
Now, as for the workflow that is the Service, you're going to have a Receive activity at the top of the workflow that also has CanCreateInstance set the true. The everything after the Receive is no different than any other workflow. You may consider having a Send activity right after the Receive just to let the caller know that the Service has been started. But that's not a requirement -- the Receive will be required because it forces WF to build the workflow to use the WorkflowServiceHost.
Is there any other solutions to handle batch of records/ids? Otherwise workflow service has to be called 1000s of times just to register every record in a workflow instance which is a not a good solution.
Are you indicating that a for a web server to receive 1000's of requests is not a good solution? Consider the fact that an IIS server can handle roughly 25-50 requests, per instant in time, per core. Now consider the fact that you're loop that's loading the workflows isn't going to average more than maybe 5 in that instant of time but probably more like 1 or 2.
I don't think the web server is going to be your issue. I've started up literally 10,000's of workflows on a server via a loop just like the one you're going to build and it didn't break a sweat.
One way would be to use WCF's MSMQ binding to launch your workflows. Requests can come in normally through HTTP, and WCF would route them to MSMQ and process the load. You can throttle how many workflow instances are used through the MSMQ binding + IIS settings.
Download this word document that describes setting up a workflow application with WCF and MSMQ: http://www.microsoft.com/en-us/download/details.aspx?id=21245
In the spirit of the doing the simplest thing that could work, you can bring the subworkflow in as an activity to the main workflow and use a parallel for each to execute the branch for each input from your file. No extra invoking is required and the tooling supports this out of the box because all workflows are activities. Hosting the main process in a service so you can avoid contention with the rest of your IIS users, real people that they may are, might be a good idea.
I do agree that calling IIS or a WCF service 1000's of times is not a problem though, unless you want to do it in a few seconds!
It is important to remember that one of the good things about workflow is that it has fairly low overhead (compared to other workflow products) so you should be more concerned about what your workflow does than just the idea of launching lots of instances. The idea of batches like your example is very common.

Resources