Design of JSR352 batch job: Is several steps a better design than one large batchlet? - control-m

My JSR352 batch job needs to read from a database, and then depending on the result flows to one of two pathways, each of which involves some more if/else scenarios. I wonder what the pros and cons between writing a single step with a large batchlet and several steps consisting of smaller batchlets would be. This job does not involves chunk steps with chunk size larger than 1, as it needs to persists the read result immediately in case there is any before proceeding to other logic. The job will be run using Control-M, I wonder if using multiple smaller steps provides more control points.

From that description, I'd suggest these
Benefits of more, fine-grained steps
1. Restart
After a job failure, the default behavior on restart is to begin executing at the step where the previous job execution failed. So breaking the job up into more steps allows you to avoid writing the logic to resume where you left off and avoid re-processing, and may save execution time in the process.
2. Reuse
By encapsulating a discrete function as its own batchlet, you can potentially compose other steps in other jobs (or even later in this job) implemented with this same batchlet.
3. Extract logic into XML
By moving the transition logic into the transition elements, and extracting the conditional flow (e.g. <next on="RC1" to="step3"/>, etc.)
into the job definition XML (JSL), you can introduce changes at a standard control point, without having to go into the Java source and find the right place.
Final Thoughts
You'll have to decide if those benefits are worth it for your case.
One more thought
I wouldn't automatically rule out the chunk step just because you are using a 1-item chunk, if you can still find benefits from the checkpointing or even possibly the skip/retry. (But that's probably a separate question.)

Related

How to configure Apache Airflow with Celery to run concurrent tasks?

I am interested in this use case for my proof of concept, where i read from a file containing a huge list of ids and i want to process this ids as such func(id) concurrently.
Is it possible to configure airflow with CeleryExecutors to achieve this?
I saw this link :-
Running more than 32 concurrent tasks in Apache Airflow
But what if the number of ids are unknown and could be anywhere from 10,000 or even 100,000 and i want to process them around 500-1000 at a time?
Airflow can execute tasks in parallel, and it can use Celery to achieve this. Everything else is up to you to implement however you see fit, there are no specifics related to Airflow/Celery regarding your intended use.
In the end, if all you care about is paralleling your work and don't care much about other Airflow features, you could be better off using Celery alone.
There are many different ways to go about this, but here is some food for though to get you started:
Airflow tasks should be as "dumb" as possible, i.e. take an input, process it and store the output. Don't put your file-splitting logic here. You can have a dedicated DAG for that if needed. For example, you can have a DAG which reads the input file and chunks it up via some logic, then store it somewhere for tasks to pick up (convenient file structure, message queue, db, etc.)
Decide on a place for your input data such that tasks can easily pick up a limited amount of input. For example, if you're using a file structure, where one chunk to be processed is a single file, a task can get read a single file and remove it. Repeat until no chunks/files are left. Same goes for any other way, e.g. if using a message queue you can consume the chunks. Make sure you have that original DAG ready to split up the input file into chunks again if needed. You are free to make this as simple or as complex as you want.
Watch out for idempotency, e.g. make sure your process can be repeated without side-effects. If you lose data in some step, you can just restart everything without issues.

How Redis RDB persistance actually works behind the scene?

i was going through Redis RDB persistence. I having some doubts regarding RDB persistence related to its disadvantage.
Understanding So far:
We should use rdb persistence when we need to save the snapshot of dataset currently in memory at some regular interval.
I can understand that in this way we can lose some data in case of server break down. But another disadvantage that i can't understand is how fork can be time consuming when persisting large dataset using rdb.
Quoting from Documentation
RDB needs to fork() often in order to persist on disk using a child
process. Fork() can be time consuming if the dataset is big, and may
result in Redis to stop serving clients for some millisecond or even
for one second if the dataset is very big and the CPU performance not
great. AOF also needs to fork() but you can tune how often you want to
rewrite your logs without any trade-off on durability.
I know how fork works as per my knowledge When parent process forks it create a new Child process and we can allow some code that child process will execute based on its pid or we can provide it some new executable that it will work on using exec() system call.
but things that i don't understand how it will be heavy task when size of dataset is larger?
I think i know the answer but i m not sure about that
Quoted from this link https://www.bottomupcs.com/fork_and_exec.xhtml
When a process calls fork then
the operating system will create a new process that is exactly the same as the parent process. This means all the state that was talked about previously is copied, including open files, register state and all memory allocations, which includes the program code.
As per above statement whole dataset of redis will be copied to child.
Am i understanding right?
When standard fork is called with copy-on-write the OS must still copy all the page table entries, which can take time time if you have small 4k pages and a huge dataset, this is what makes the actual fork() time slow.
You can also find a lot of time and memory is required if your dataset is changing a lot in a sparse way, as copy-on-write semantics triggers the actual memory pages to be copied as changes are made to the original. Redis also performs incremental rehashing and maintains expiry etc. so an instance that is more active will typically take longer to save to disk.
More reading:
Faster forking of large processes on Linux?
http://kirkwylie.blogspot.co.uk/2008/11/linux-fork-performance-redux-large.html

How the chances of getting "read-your-writes" consistency are increased in Dynamo?

In Section 5 of Dynamo paper, there is the following content:
In particular, since each write usually follows a read operation, the
coordinator for a write is chosen to be the node that replied fastest to the
previous read operation which is stored in the context information of the
request. This optimization enables us to pick the node that has the data that
was read by the preceding read operation thereby increasing the chances of
getting "read-your-writes" consistency.
How the chances of getting "read-your-writes" consistency are increased?
"read-your-writes" means that a read following a write gets the value set by the
write. The read and the write are performed by two different clients for this
context. The reason is that the choice of the write coordinator does not impact
on the chances of getting "read-your-writes" by the same client.
But the above text is talking about a write following a read. Here is my guess.
The read coordinator will try to do syntactic reconciliation if it is possible.
If syntactic reconciliation is impossible because of divergent versions, the
client need to do semantic reconciliation before doing a write. Either way, the
versions on all the nodes involved in the read operation is an ancestor of the
reconciled version. So the following write can be sent to any of them to get
applied. The earliest time for a write to be seen by a read is after the
following steps are finished:
Client contact the write coordinator.
The write coordinator generates the version clock for the new version.
The write coordinator writes the new version locally.
The shorter the time to perform the above steps, the more likely another
following read sees the new version. Since it is very possible that the node
which replied fastest to the previous read can perform the following steps in a
shorter time. Such a node is chosen as the write coordinator.
Section 2.3 talks about performing the reconciliation at read time rather than write time.
Data versioning - "One can determine whether two versions of an
object are on parallel branches or have a causal ordering, by
examine their vector clocks."
This paragraph from section 4. [emphasis mine]
In Dynamo, when a client wishes to update an object, it must specify
which version it is updating. This is done by passing the context it
obtained from an earlier read operation, which contains the vector
clock information. Upon processing a read request, if Dynamo has
access to multiple branches that cannot be syntactically reconciled,
it will return all the objects at the leaves, with the corresponding
version information in the context. An update using this context is
considered to have reconciled the divergent versions and the
branches are collapsed into a single new version
So by performing the read first, you're effectively reconciling all divergent versions prior to writing. By writing to that same node, the version you've updated is marked with the context and vector clock of the most up to date version and all divergent branches can be collapsed. This is sent to the top N nodes (as you've stated) as fast as possible. But by removing the divergent branches - you reduce the chance that multiple values could be returned. You only need one of the N nodes read in the next read to get the reconciled write. ie - the node as part of the quorum of R reads says - "I am the reconciled version, and all others must bow to me". (and if that has already been distributed to another of the "R" nodes, then there's even greater chance of getting the reconciled version in the quorum)
But, if you wrote to a different node, one that you hadn't read from - the vector clock that is being updated may not necessarily be a reconciled version of the object. Therefore, you could still have divergent branches. The following read will try and reconcile it, but it's more more probable that you could have multiple divergent data and no reconciliation.
If you've made it this far, I think the most interesting part is that per Section 6, client applications can dictate the values of N, R and W - ie - number of nodes that constitute the pool to draw from, and the number of nodes that must agree on a read or write for it to be successful.
Geez - my head hurts now.
I re-read the Dynamo paper. I have a new understanding of "read-your-write" consistency. "read-your-writes" involves only one client. Image the following requests performed by one client on the same key:
read-1
write-1
read-2
"read-your-writes" means that read-2 sees write-1. The write coordinator has the best chance to have write-1. To ensure "read-your-writes", it is desired that the write coordinator replies fastest to read-2. It is highly possible that the node replies fastest to read-1 also reply fastest to read-2. So choose the node replies fastest to read-1 as the write coordinator.
And what is the node that replied fastest to the previous read operation? Such a node only makes sense if client-driven coordination is used. For server-side coordination, the coordinator nodes replies to the client and the other involved nodes reply to the coordinator node. replied fastest is meaningless in this case.

How to Implement embarrassingly parallel task (FOR loop) WITHOUT MPI-IO?

Preamble:
I have a very large array (one dim) and need to solve evolution equation (wave-like eq). I I need to calculate integral at each value of this array, to store the resulting array of integral and apply integration again to this array, and so on (in simple words, I apply integral on grid of values, store this new grid, apply integration again and so on).
I used MPI-IO to spread over all nodes: there is a shared .dat file on my disc, each MPI copy reads this file (as a source for integration), performs integration and writes again to this shared file. This procedure repeats again and again. It works fine. The most time consuming part was the integration and file reading-writing was negligible.
Current problem:
Now I moved to 1024 (16x64 CPU) HPC cluster and now I'm facing an opposite problem: a calculation time is NEGLIGIBLE to read-write process!!!
I tried to reduce a number of MPI processes: I use only 16 MPI process (to spread over the nodes) + 64 threads with OpenMP to parallelize my computation inside of each node.
Again, reading and writing processes is the most time consuming part now.
Question
How should I modify my program, in order to utilize the full power of 1024 CPUs with minimal loss?
The important point, is that I cannot move to the next step without completing the entire 1D array.
My thoughts:
Instead of reading-writing, I can ask my rank=0 (master rank) to send-receive the entire array to all nodes (MPI_Bcast). So, instead of each node will I/O, only one node will do it.
Thanks in advance!!!
I would look here and here. FORTRAN code for the second site is here and C code is here.
The idea is that you don't give the entire array to each processor. You give each processor only the piece it works on, with some overlap between processors so they can handle their mutual boundaries.
Also, you are right to save your computation to disk every so often. And I like MPI-IO for that. I think it is the way to go. But the codes in the links will allow you to run without reading every time. And, for my money, writing out the data every single time is overkill.

Hadoop suitability for recursive data processing

I have a filtering algorithm that needs to be applied recursively and I am not sure if MapReduce is suitable for this job. W/o giving too much away, I can say that each object that is being filtered is characterized by a collection if ordered list or queue.
The data is not huge, just about 250MB when I export from SQL to
CSV.
The mapping step is simple: the head of the list contains an object that can classify the list as belonging to one of N mapping nodes. the filtration algorithm at each node works on the collection of lists assigned to the node and at the end of the filtration, either a list remains the same as before the filtration or the head of the list is removed.
The reduce function is simple too: all the map jobs' lists are brought together and may have to be written back to disk.
When all the N nodes have returned their output, the mapping step is repeated with this new set of data.
Note: N can be as much as 2000 nodes.
Simple, but it requires perhaps up to a 1000 recursions before the algorithm's termination conditions are met.
My question is would this job be suitable for Hadoop? If not, what are my options?
The main strength of Hadoop is its ability to transparently distribute work on a large number of machines. In order to fully benefit from Hadoop your application has to be characterized, at least by the following three things:
work with large amounts of data (data which is distributed in the cluster of machines) - which would be impossible to store on one machine
be data-parallelizable (i.e. chunks of the original data can be manipulated independently from other chunks)
the problem which the application is trying to solve lends itself nicely to the MapReduce (scatter - gather) model.
It seems that out of these 3, your application has only the last 2 characteristics (with the observation that you are trying to recursively use a scatter - gather procedure - which means a large number of jobs - equal to the recursion depth; see last paragraph why this might not be appropriate for hadoop).
Given the amount of data you're trying to process, I don't see any reason why you wouldn't do it on a single machine, completely in memory. If you think you can benefit from processing that small amount of data in parallel, I would recommend focusing on multicore processing than on distributed data intensive processing. Of course, using the processing power of a networked cluster is tempting but this comes at a cost: mainly the time inefficiency given by the network communication (network being the most contended resource in a hadoop cluster) and by the I/O. In scenarios which are well-fitted to the Hadoop framework these inefficiency can be ignored because of the efficiency gained by distributing the data and the associated work on that data.
As I can see, you need 1000 jobs. The setup and the cleanup of all those jobs would be an unnecessary overhead for your scenario. Also, the overhead of network transfer is not necessary, in my opinion.
Recursive algos are hard in the distributed systems since they can lead to a quick starvation. Any middleware that would work for that needs to support distributed continuations, i.e. the ability to make a "recursive" call without holding the resources (like threads) of the calling side.
GridGain is one product that natively supports distributed continuations.
THe litmus test on distributed continuations: try to develop a naive fibonacci implementation in distributed context using recursive calls. Here's the GridGain's example that implements this using continuations.
Hope it helps.
Q&D, but I suggest you read a comparison of MongoDB and Hadoop:
http://www.osintegrators.com/whitepapers/MongoHadoopWP/index.html
Without knowing more, it's hard to tell. You might want to try both. Post your results if you do!

Resources