MPI How to balance workload in an unknown length problem - mpi

I have an MPI program which traverses through a graph to solve a problem.
if some rank finds another branch of the graph it will send the task to another random rank. all ranks will wait and receive another task after they complete one.
I have 4 processors, when I see the CPU usage when running the program I will usually see 2-3 processors at max while 1-2 processors idling because the tasks are not split equally among the ranks.
to solve this issue I have to know which rank is not already busy solving some task. so when some rank finds another branch in the graph it will see which rank is free to work on this branch and send it the task.
Q: How can I balance the workload between the ranks.
note: I don't know the length or the size of the graph so I can't split the tasks into ranges for each rank when starting. I have to visit each node on the fly and check if it solves the graph problem. if not I send the next node branches to other ranks.

Related

What does the minimum number of nodes in an AzureML compute cluster imply?

When defining an AzureML compute cluster in the AzureML Studio there is a setting that relates to the minimum number of nodes:
Azure Machine Learning Compute can be reused across runs. The compute
can be shared with other users in the workspace and is retained
between runs, automatically scaling nodes up or down based on the
number of runs submitted, and the max_nodes set on your cluster. The
min_nodes setting controls the minimum nodes available.
(From here.)
I do not understand what min_nodes actually is. Is it the number of nodes that the cluster will keep allocated even when idle (i.e. something one might want to speed start-up time)?
I found a better explanation, under a tooltip in the AzureML Studio
To avoid charges when no jobs are running, set the minimum nodes to 0.
This setting allows Azure Machine Learning to de-allocate the compute
nodes when idle. Any higher value will result in charges for the
number of nodes allocated.
So it is the minimum number of nodes allocated, even when the cluster is idle.

MPI_Scatter: order of scatter

I my work, I noticed that even if I scatter same amount of data to each process, it takes more time to transfer data from root to the highest-rank process. I tested this on distributed memory machine. If a MWE is needed I will prepare one but before that I would like to know if MPI_Scatter gives privilege to lower rank processes.
The MPI standard does not say such a thing, so MPI libraries are free to implement MPI_Scatter() the way they want regarding which task might return earlier than others.
Open MPI for example can either do a linear or a binomial scatter (by default, the algo is chosen based on communicator and message sizes).
That being said, all data has to be sent from the root process to the other nodes, so obviously, some nodes will be served first. If root process has rank zero, i would expect the highest rank process receive the data at last (i am not aware of any MPI library implementing a topology aware MPI_Scatter(), but that might come some day). If root process has not rank zero, then MPI might internally renumber the ranks (so root is always virtual rank zero), and if this pattern is implemented, the last process to receive the data would be (root + size - 1) % size.
If this is suboptimal from your application point of view, you always have the option to re-implement MPI_Scatter() your own way (that can call the library provided PMPI_Scatter() if needed). An other approach would be to MPI_Comm_split() (with a single color) in order to renumber the ranks, and use the new communicator for MPI_Scatter()

Intel MPI distributed memory: building a wall out of M*N blocks using q<M processors

Imagine I have M independent jobs, each job has N steps. Jobs are independent from each other but steps of each job should be serial. In other words J(i,j) should be started only after J(i,j-1) is finished (i indicates the job index and j indicates the step). This is isomorphic to building a wall with width of M and hight of N blocks.
Each block of job should be executed only once. The time that it takes to do one block of work using one CPU (also the same order) is different for different blocks and is not known in advance.
The simple way of doing this using MPI is to assign blocks of work to processors and wait until all of them finish their blocks before the next assignment. This way we can make ensure that priorities are enforced, but there will be a lot of waiting time.
Is there a more efficient way of doing this? I mean when a processor finishes its job, using some kind of environmental variables or shared memory, could decide which block of job it should do next, without waiting for other processors to finish their jobs and make a collective decision using communications.
You have M jobs with N steps each. You also have a set of worker processes of size W, somewhere between 2 and M.
If W is close to M, the best you can do is simply assign them 1:1. If one worker finishes early that's fine.
If W is much smaller than M, and N is also fairly large, here is an idea:
Estimate some average or typical time for one step to complete. Call this T. You can adjust this estimate as you go in case you have a very poor estimator at the outset.
Divide your M jobs evenly in number among the workers, and start them. Tell the workers to run as many steps of their assigned jobs as possible before a timeout, say T*N/K. Overrunning the timeout slightly to finish the current job is allowed to ensure forward progress.
Have the workers communicate to each other which steps they completed.
Repeat, dividing the jobs evenly again taking into account how complete each one is (e.g. two 50% complete jobs count the same as one 0% complete job).
The idea is to give all the workers enough time to complete roughly 1/K of the total work each time. If no job takes much more than K*T, this will be quite efficient.
It's up to you to find a reasonable K. Maybe try 10.
Here's an idea, IDK if it's good:
Maintain one shared variable: n = the progress of the farthest-behind task. i.e. the lowest step-number that any of the M tasks has completed. It starts out at 0, because all tasks start at the first step. It stays at 0 until all tasks have completed at least 1 step each.
When a processor finishes a step of a job, check the progress of the step it's currently working on against n. If n < current_job_step - 4, switch tasks because the one we're working on is too far ahead of the farthest-behind one.
I picked 4 to give a balance between too much switching vs. having too much serial work in only a couple tasks. Adjust as necessary, and maybe make it adaptive as you near the end.
Switching tasks without having two threads both grab the same work unit is non-trivial unless you have a scheduler thread that makes all the decisions. If this is on a single shared-memory machine, you could use locking to protect a priority queue.

m/m/1 Queue Examples

I am having hard time working on M/M/1 queue (Common queue architecture). I understand that
(lambda)^2/(mu*(mu-lambda)) = the average number of customers waiting in line
the part I am struggling with is that my queue is limited to only 3 clients waiting then anything after that they get dropped. So how do I find my average customers waiting in line now?
Logically, limiting the queue makes certain queue states (i.e > n) impossible. Thus your probability of being in all states < n sum to 1.0.
Doing a simple Google search for "mm1 with limited queue size" the first result is a PDF that answers your question. The paper actually gives usable formulas.

How many mappers/reducers should be set when configuring Hadoop cluster?

When configuring a Hadoop Cluster whats the scientific method to set the number of mappers/reducers for the cluster?
There is no formula. It depends on how many cores and how much memory do you have. The number of mapper + number of reducer should not exceed the number of cores in general. Keep in mind that the machine is also running Task Tracker and Data Node daemons. One of the general suggestion is more mappers than reducers. If I were you, I would run one of my typical jobs with reasonable amount of data to try it out.
Quoting from "Hadoop The Definite Guide, 3rd edition", page 306
Because MapReduce jobs are normally
I/O-bound, it makes sense to have more tasks than processors to get better
utilization.
The amount of oversubscription depends on the CPU utilization of jobs
you run, but a good rule of thumb is to have a factor of between one and two more
tasks (counting both map and reduce tasks) than processors.
A processor in the quote above is equivalent to one logical core.
But this is just in theory, and most likely each use case is different than another, some tests need to be performed. But this number can be a good start to test with.
Probably, you should also look at reducer lazy loading, which allows reducers to start later when required, so basically, number of maps slots can be increased. Don't have much idea on this though but, seems useful.
Taken from Hadoop Gyan-My blog:
No. of mappers is decided in accordance with the data locality principle as described earlier. Data Locality principle : Hadoop tries its best to run map tasks on nodes where the data is present locally to optimize on the network and inter-node communication latency. As the input data is split into pieces and fed to different map tasks, it is desirable to have all the data fed to that map task available on a single node.Since HDFS only guarantees data having size equal to its block size (64M) to be present on one node, it is advised/advocated to have the split size equal to the HDFS block size so that the map task can take advantage of this data localization. Therefore, 64M of data per mapper. If we see some mappers running for a very small period of time, try to bring down the number of mappers and make them run longer for a minute or so.
No. of reducers should be slightly less than the number of reduce slots in the cluster (the concept of slots comes in with a pre-configuration in the job/task tracker properties while configuring the cluster) so that all the reducers finish in one wave and make full utilisation of the cluster resources.

Resources