I read from Oozie official site: Actions Are Asynchronous
All computation/processing tasks triggered by an action node are executed asynchronously by Oozie. For most types of computation/processing tasks triggered by workflow action, the workflow job has to wait until the computation/processing task completes before transitioning to the following node in the workflow.
Whereas on different page of the same site: Fs HDFS action
The introduction of FS action (synchronous action) told that:
The FS commands are executed synchronously from within the FS action, the workflow job will wait until the specified file commands are completed before continuing to the next action.
Why synchronous and asynchronous introduction is basically the same?According to my understanding from the operating system principle course, asynchronous means the function does not wait but continue the execution.
Excerpt From: Mohammad Kamrul Islam and Aravind Srinivasan. “Apache Oozie.”
Asynchronous Actions : All Hadoop actions and the <shell> action follow the “Action Execution Model”. These are called asynchronous actions because they are launched via a launcher as Hadoop jobs.
Synchronous Actions : The filesystem action, email action, SSH action, and sub-workflow action are executed by the Oozie server itself and are called synchronous actions. The execution of these synchronous actions do not require running any user code—just access to some libraries.
Essentially, In both the cases Oozie server waits for the completion of the action and then only moves to the next action in the DAG. The separation is mainly based on whether actinos executes on the same Oozie server or on the Hadoop cluster.
Here is list of Oozie Actions and their Action Execution Model.
Related
Symfony 2.8
Using https://github.com/j-guyon/CommandSchedulerBundle to manage periodic Command executions.
Each of these Command executions invokes an specific Service based on the Command arguments.
Being in the Services (all of them implementing the same Interface and extending an Abstract class), the plan is to create and execute sub-processes (asynchronously if possible)
Based in your experience, which will be the best way to deal with that sub-processes?
Create a Process object (based on a Controller Action) for each sub-process, and run them synchronously (https://symfony.com/doc/2.8/components/process.html)
Use kind of Queue Bundle to deal with all of them (Process or Messages or whatever), such https://php-enqueue.github.io/symfony or https://github.com/armetiz/LeezyPheanstalkBundle (any other suggestion?)
Cheers!
When using Google Cloud Tasks, how can i prematurely run a tasks that is in the queue. I have a need to run the task before it's scheduled to run. For example the user chooses to navigate away from the page and they are prompted. If they accept the prompt to move away from that page, i need to clear the queued task item programmatically.
I will be running this with a firebase-function on the backend.
Looking at the API for Cloud Tasks found here it seems we have primitives to:
list - get a list of tasks that are queued to run
delete - delete a task this is queued to run
run - forces a task to run now
Based on these primitives, we seem to have all the "bits" necessary to achieve your ask.
For example:
To run a task now that is scheduled to run in the future.
List all the tasks
Find the task that you want to run now
Delete the task
Run a task (now) using the details of the retrieved task
We appear to have a REST API as well as language bound libraries for the popular languages.
I am having a task which will listen for certain events and kick-off other functions.
This function (the listener) subscribes to a kafka topic and runs forever, or at least until it will get a 'stop' event.
Wrapping this as an airflow operator doesn't seem to work properly.
Meaning, if I send the stop event, it will not process it, or anything else for that matter.
Is it possible to run busy loop functions in airflow ?
No, do not run infinite loops in an Airflow task.
Airflow is designed as a batch processor - long/infinite running tasks are counter to it's entire scheduling and processing model and while it might "work", it will lock up a task runner slot.
I am writing a custom EL function which will be used in oozie workflows.
this custom function is just plain java code it doesn't contain any hadoop code.
My question is where will this EL function be executed at the time the workflow is running?
Will it execute my EL function on the Oozie node itself? or will it push my custom java code to one of the data nodes and execute it there?
Oozie is a workflow scheduler system to manage jobs in Hadoop Cluster it self, which integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Source
Which means if you submit a Job in Oozie, it will run in any of the available DataNode it self, even if your Oozie Service is configured in Datanode then it can run there as well.
For checking which Node the Job is processing, you have to check the same from JobTracker in Hadoop1 or Yarn in Hadoop2 which redirect the Process State to the Tasktracker node where the Job is being process
Acording to Apache Oozie: The Workflow Scheduler for Hadoop, page 177, it states:
It is highly recommended that the new EL function be simple, fast and
robust. This is critical because Oozie executes the EL functions on
the Oozie server
So It will be executed on your Oozie node itself.
We are currently using Apache Mesos with Marathon and Chronos to schedule long running and batch processes.
It would be great if we could create more complex workflows like with Oozie. Say for example kicking of a job when a file appears in a location or when a certain application completes or calls an API.
While it seems we could do this with Marathon/Chronos or Singularity, there seems no readily available interface for this.
You can use Chronos' /scheduler/dependency endpoint to specify "all jobs which must run at least once before this job will run." Do this on each of your Chronos jobs, and you can build arbitrarily complex workflow DAGs.
https://airbnb.github.io/chronos/#Adding%20a%20Dependent%20Job
Chronos currently only schedules jobs based on time or dependency triggers. Other events like file update, git push, or email/tweet could be modeled as a wait-for-X job that your target job would then depend on.