JMH setup and tear down - jmh

I've created a class. Within that class I have several methods that are marked as #Benchmark. I also have a main method that runs the JMH benchmarks:
System.out.println("NUMBER OF THREADS: "+numOfThreads);
Options opt = new OptionsBuilder()
.include(JMHtopToBottom.class.getSimpleName())
.warmupIterations(5)
.measurementIterations(3)
.forks(numOfThreads)
.build();
Collection<RunResult> collection = new Runner(opt).run();
My interest is to have:
A setup method that runs only ones - right after the new Runner(opt).run(); and before all of the #Benchmark methods are called (along with their iterations).
As well, to have a tear down method that runs only once right after all the methods runs and before we go back to main.
When I tried #setup and #tear_down (with Level support: Trial/Iteration/Invocation) the methods run several times and not only ones as I wished. Is there a way in JMH to annotate methods so it will run just ones - right after run() and right before the run() is over?

You are missing a few things:
Forks are not threads, they are separate processes launched to run each benchmark. I.e if you set forks to 5 any benchmark (in the selected benchmark set) will be run 5 times, each time in a separate VM.
Unless forks=0 (not recommended as benchmark isolation is gone, mixed compilation profiles etc, meant mostly for debugging) all benchmarks are run in separate processes. So each 'Trial' setup/teardown for a given benchmark will run once for that JVM. There is no shared 'Suite' context.
If you absolutely require some 'Suite' level context you'll have to construct it out of VM (e.g. some file that is read on benchmark setup/updated on teardown etc.).

Related

Design of JSR352 batch job: Is several steps a better design than one large batchlet?

My JSR352 batch job needs to read from a database, and then depending on the result flows to one of two pathways, each of which involves some more if/else scenarios. I wonder what the pros and cons between writing a single step with a large batchlet and several steps consisting of smaller batchlets would be. This job does not involves chunk steps with chunk size larger than 1, as it needs to persists the read result immediately in case there is any before proceeding to other logic. The job will be run using Control-M, I wonder if using multiple smaller steps provides more control points.
From that description, I'd suggest these
Benefits of more, fine-grained steps
1. Restart
After a job failure, the default behavior on restart is to begin executing at the step where the previous job execution failed. So breaking the job up into more steps allows you to avoid writing the logic to resume where you left off and avoid re-processing, and may save execution time in the process.
2. Reuse
By encapsulating a discrete function as its own batchlet, you can potentially compose other steps in other jobs (or even later in this job) implemented with this same batchlet.
3. Extract logic into XML
By moving the transition logic into the transition elements, and extracting the conditional flow (e.g. <next on="RC1" to="step3"/>, etc.)
into the job definition XML (JSL), you can introduce changes at a standard control point, without having to go into the Java source and find the right place.
Final Thoughts
You'll have to decide if those benefits are worth it for your case.
One more thought
I wouldn't automatically rule out the chunk step just because you are using a 1-item chunk, if you can still find benefits from the checkpointing or even possibly the skip/retry. (But that's probably a separate question.)

Query in MPI initialization

If we call MPI_Init() we know that multiple copies of the same executable run on different machines. Suppose MPI_Init() is in a function f(), then will multiple copies of main() function exist too?
The main problem that I am facing is of taking inputs. In effect, what is happening is that input is being taken once but the main function is running several times. The processor with rank 0 always seems to have the input, rest of them have random values. So to send the values do we have to broadcast the input from processor 0 to all the other processors?
MPI_Init() doesn't create multiple copies, it just initializes in-process MPI library. Multiple copies of your process are created before that, most probably with some kind of mpirun command (that is how you run your MPI application). All processes are independent from the beginning, so answering the first part of your question — yes, multiple copies of main() will exist, and they will exist even if you don't call MPI_Init.
The answer to your question about inputs depends on nature of the inputs: if it's typed in from console, then you have to input the values only in one process (e.g. rank 0) and then broadcast them. If the inputs are in some file or specified as a command-line argument, then all processes can access them.

MPI and global variables

I have to implement an MPI program. There are some global variables (4 arrays of float numbers and other 6 single float variables) which are first inizialized by the main process reading data from a file. Then I call MPI_Init and, while process of rank 0 waits for results, the other processes (rank 1,2,3,4) work on the arrays etc...
The problem is that those array seem not to be initialized anymore, all is set to 0. I tried to move global variable inside the main function but the result is the same. When MPI_Init() is called all processes are created by fork right? So everyone has a memory copy of the father so why do they see not initizialized arrays?
I fear you have misunderstood.
It is probably best to think of each MPI process as an independent program, albeit one with the same source code as every other process in the computation. Operations that process 0 carries out on variables in its address space have no impact on the contents of the address spaces of other processes.
I'm not sure that the MPI standard even requires process 0 to have values for variables which were declared and initialised prior to the call to mpi_init, that is before process 0 really exists.
Whether it does or not you will have to write code to get the values into the variables in the address space of the other processes. One way to do this would be to have process 0 send the values to the other processes, either one by one or using a broadcast. Another way would be for all processes to read the values from the input files; if you choose this option watch out for contention over i/o resources.
In passing, I don't think it is common for MPI implementations to create processes by forking at the call to mpi_init, forking is more commonly used for creating threads. I think that most MPI implementations actually create the processes when you make a call to mpiexec, the call to mpi_init is the formality which announces that your program is starting its parallel computations.
When MPI_Init() is called all processes are created by fork right?
Wrong.
MPI spawns multiple instances of your program. These instances are separate processes, each with its own memory space. Each process has its own copy of every variable, including globals. MPI_Init() only initializes the MPI environment so that other MPI functions can be called.
As the other answers say, that's not how MPI works. Data is unique to each process and must be explicitly transferred between processes using the API available in the MPI specification.
However, there are programming models that allow this sort of behavior. If, when you say parallel computing, you mean multiple cores on one processor, you might be better served by using something like OpenMP to share your data between threads.
Alternatively, if you do in fact need to use multiple processors (either because your data is too big to fit in one processor's memory, or some other reason), you can take a look at one of the Parallel Global Address Space (PGAS) languages. In those models, you have memory that is globally available to all processes in an execution.
Last, there is a part of MPI that does allow you to expose memory from one process to other processes. It's the Remote Memory Access (RMA) or One-Sided chapter. It can be complex, but powerful if that's the kind of computing model you need.
All of these models will require changing the way your application works, but it sounds like they might map to your problem better.

How to control the number of threads when executing an Asynchronous Activity in WF 4

I am creating a workflow in WF 4, where I have a ParallelForeach activity that iterates over a collection of items. For each item in the collection, I execute a custom Asynchronous activity to processing multiple items in parallel.
The above solution works for me, but I am concerned about the number of threads used since each Asynchronous activity instance is executed on its own thread. Is there a way to configure/control the number of threads that get launched when executing the parallelForeach activity in the above described mechanism?
since each Asynchronous activity instance is getting executed on its own thread. Who says? Certainly not the docs.
ParallelForEach enumerates its values and schedules the Body for every value it enumerates on. It only schedules the Body. How the body executes depends on whether the Body goes idle.
If the Body does not go idle, it executes in a reverse order because the scheduled activities are handled as a stack, the last scheduled activity executes first.
For example, if you have a collection of {1,2,3,4}in ParallelForEach and use a WriteLine as the body to write the value out. You have 4, 3, 2, 1 printed out in the console. This is because WriteLine does not go idle so after 4 WriteLine activities got scheduled, they executed using a stack behavior (first in last out).
The Parallelism of execution occurs only when an Activity creates a bookmark and goes idle. Even then, two activities aren't actually executing at the same time--one or more have just stopped executing, allowing others to run in order. Understandably confusing, given the name, but that's it.
In any event, when you're relying on the framework to parallelize for you, don't worry about how many threads they're using. They probably have everything under control. Until you know they don't.
Will is correct, ParallelForEach does not require a new thread for each branch. If you are doing blocking I/O in code that should occur in an AsyncCodeActivity so that you aren't unecessarily blocking. If you want CPU-bound work to run in parallel to other activities you will either need to wrap it in an AsyncCodeActivity or use InvokeMethod { RunAsynchronously = true} in which case the framework will take care of running the work on a background thread.
The SynchronizationContext extensibility point is intended for cases where you have a particular existing threading model that you need WF to integrate with. Prime examples of this include ASP.NET's threading environment, and Windows Presentation Foundation/WinForms (e.g. if you wanted a activity to work correctly).

Hadoop suitability for recursive data processing

I have a filtering algorithm that needs to be applied recursively and I am not sure if MapReduce is suitable for this job. W/o giving too much away, I can say that each object that is being filtered is characterized by a collection if ordered list or queue.
The data is not huge, just about 250MB when I export from SQL to
CSV.
The mapping step is simple: the head of the list contains an object that can classify the list as belonging to one of N mapping nodes. the filtration algorithm at each node works on the collection of lists assigned to the node and at the end of the filtration, either a list remains the same as before the filtration or the head of the list is removed.
The reduce function is simple too: all the map jobs' lists are brought together and may have to be written back to disk.
When all the N nodes have returned their output, the mapping step is repeated with this new set of data.
Note: N can be as much as 2000 nodes.
Simple, but it requires perhaps up to a 1000 recursions before the algorithm's termination conditions are met.
My question is would this job be suitable for Hadoop? If not, what are my options?
The main strength of Hadoop is its ability to transparently distribute work on a large number of machines. In order to fully benefit from Hadoop your application has to be characterized, at least by the following three things:
work with large amounts of data (data which is distributed in the cluster of machines) - which would be impossible to store on one machine
be data-parallelizable (i.e. chunks of the original data can be manipulated independently from other chunks)
the problem which the application is trying to solve lends itself nicely to the MapReduce (scatter - gather) model.
It seems that out of these 3, your application has only the last 2 characteristics (with the observation that you are trying to recursively use a scatter - gather procedure - which means a large number of jobs - equal to the recursion depth; see last paragraph why this might not be appropriate for hadoop).
Given the amount of data you're trying to process, I don't see any reason why you wouldn't do it on a single machine, completely in memory. If you think you can benefit from processing that small amount of data in parallel, I would recommend focusing on multicore processing than on distributed data intensive processing. Of course, using the processing power of a networked cluster is tempting but this comes at a cost: mainly the time inefficiency given by the network communication (network being the most contended resource in a hadoop cluster) and by the I/O. In scenarios which are well-fitted to the Hadoop framework these inefficiency can be ignored because of the efficiency gained by distributing the data and the associated work on that data.
As I can see, you need 1000 jobs. The setup and the cleanup of all those jobs would be an unnecessary overhead for your scenario. Also, the overhead of network transfer is not necessary, in my opinion.
Recursive algos are hard in the distributed systems since they can lead to a quick starvation. Any middleware that would work for that needs to support distributed continuations, i.e. the ability to make a "recursive" call without holding the resources (like threads) of the calling side.
GridGain is one product that natively supports distributed continuations.
THe litmus test on distributed continuations: try to develop a naive fibonacci implementation in distributed context using recursive calls. Here's the GridGain's example that implements this using continuations.
Hope it helps.
Q&D, but I suggest you read a comparison of MongoDB and Hadoop:
http://www.osintegrators.com/whitepapers/MongoHadoopWP/index.html
Without knowing more, it's hard to tell. You might want to try both. Post your results if you do!

Resources