Collective communications along directions in MPI cartesian topology - mpi

I have a 3D cartesian topology of nx by ny by nz processes.
There are mathematical calculations that involve at the same time only "pencils" of processors. In the case of a 3 by 3 by 3 matrix of processes, ranked from 0 to 26, process 4 is involved in three operations:
with processes 13 and 22 along the first direction
with processes 1 and 7 along the second direction
with processes 3 and 5 along the third direction
That mathematical operations require both point to point and collective communications between processes belonging to the same pencil.
For what concerns point to point communications, I used MPI_CART_SHIFT to make each process know the ranks of neighboring processes. (Then I'm going to use MPI_SENDRECV.)
For what concerns collective communications, how to perform such communications?
I think a solution could be to define "pencil" communicators, which would be in number of nx*ny + nx*nz + ny*nz (this number of communicators required is asymptotically small with respect to the number of processes, as the number of processes per direction grows).
Would this be the only way? Is there no standard subroutine relying on the cartesian communicator to perform such collective communications?

The neighborhood collectives are really the only routines that can directly exploit the connectivity information for cartesian topologies. However, they would treat all the directions (x, y, z) in the same way so wouldn't help you with your pencil scheme.
I think the only way to do this is as you suggest, i.e. construct a complete set of pencil communicators. Note that MPI does give you an easy way to do this by calling MPI_Cart_sub on the cartesian communicator. Once you've constructed the pencil communicators then you would also have the option of using neighborhood collectives on the pencils rather than point-to-point, but for a 1D communicator it's not clear that this has many advantages to computing neighbours by hand as you do at present.

Related

Algorithm to balance a set of differently sized matrix blocks between processes

I want to balance a set of matrix blocks between processes. The matrix blocks have different sizes, although typically one single block dominates, being of similar size or even larger than all the other blocks combined. The number of processes may be anywhere between much larger and much smaller than the number of blocks. Each block can be stored either on a single process or distributed as a ScaLAPACK array. The balancing should qualitatively fulfill the following conditions:
No process should receive much more matrix elements than target_load = sum(size(blocks[:])) / n_procs
No block should be distributed over much more processes than size(block) / target_load. MPI communicators may be split off from mpi_comm_world
MPI communicators cannot overlap (blocks 1 and 2 both being distributed over processes 0:4 is fine, but block 1 being distributed over processes 0:3 and block 2 being distributed over processes 2:5 is not ok; undistributed blocks may be stacked arbitrarily on top of distributed blocks)
I am aware that such a distribution will depend on how strongly and in which priority the first two conditions are applied (the third condition should apply strictly). Nonetheless, is there any algorithm that facilitates some interpretation of these conditions?
This is a pretty common problem in computer science, so an off-the-shelf library should be able to help you. You should check out metis and/or SCOTCH to see if either of these will be suitable for your needs.
Your first condition is 'load balance' and your second condition is something like 'communication cost' (i.e. the cost of MPI communication in divided blocks).
The proper balance between these two conditions will totally depend on the nature of your problem, but using SCOTCH or metis you should be able to tweak these parameters till you find the best combination.

Difference between All-to-All Reduction and All-Reduce in MPI

Trying to figure out the difference between All-to-All Reduction and All-Reduce in open MPI. From my understanding All-to-One Reduction takes a piece m (integer, array, etc..) from all processes and combines all the pieces together with an operator (min, max, sum, etc..) and stores it in the selected process. From this i assume that All-to-All Reduction is the same but the product is stored in all the processes instead of just one. From this document it seems like All-Reduce is basically doing the same as All-to-All Reduction, is this right or am i getting it wrong?
The all-reduce (MPI_Allreduce) is a combined reduction and broadcast (MPI_Reduce, MPI_Bcast). They might have called it MPI_Reduce_Bcast. It is important to note that a MPI reduction does not do any global reduction. So if have 10 numbers each on 5 processes, after a MPI_Reduce one process has 10 numbers. After MPI_Allreduce, all 5 processes have the same 10 numbers.
In contrast, the all-to-all reduction performs a reduction and scatter, hence it is called MPI_Reduce_scatter[_block]. So if you have 10 numbers each on 5 processes, after a MPI_Reduce_scatter_block, the 5 processes have 2 numbers each. Note that MPI doesn't itself use the terminology all-to-all reduction, probably due to the misleading ambiguity.

Cellular automaton with more then 2 states(more than just alive or dead)

I am making a roguelike where the setting is open world on a procedurally generated planet. I want the distribution of each biome to be organic. There are 5 different biomes. Is there a way to organically distribute them without a huge complicated algorithm? I want the amount of space each biome takes up to be nearly equal.
I have worked with cellular automata before when I was making the terrain generators for each biome. There were 2 different states for each tile there. Is there an efficient way to do 5?
I'm using python 2.5, although specific code isn't necessary. Programming theory on it is fine.
If the question is too open ended, are there any resources out there that I could look at for this kind of problem?
You can define a cellular automaton on any cell state space. Just formulate the cell update function as F:Q^n->Q where Q is your state space (here Q={0,1,2,3,4,5}) and n is the size of your neighborhood.
As a start, just write F as a majority rule, that is, 0 being the neutral state, F(c) should return the value in 1-5 with the highest count in the neighborhood, and 0 if none is present. In case of equality, you may pick one of the max at random.
As an initial state, start with a configuration with 5 relatively equidistant cells with the states 1-5 (you may build them deterministically through a fixed position that can be shifted/mirrored, or generate these points randomly).
When all cells have a value different than 0, you have your map.
Feel free to improve on the update function, for example by applying the rule with a given probability.

Generalization of MPI rank number to MPI groups?

Is there any generalization of rank number to group numbers? For my code I would like to create a hierarchical decomposition of MPI::COMM_WORLD. Assume we make use of 16 threads. I use MPI::COMM_WORLD.Split to create 4 communicators each having 4 ranks. Is there now an MPI function that provides some unique ids to the corresponding four groups?
Well, you can still refer to each process by its original rank in MPI_COMM_WORLD. You also have complete control over what rank each process receives in its new communicator via the color and key arguments of MPI_Comm_split(). This is plenty enough information to create a mapping between old ranks and new groups/ranks.
If you don't like #suszterpatt's answer (I do) you could always abuse a Cartesian communicator and pretend that the process at index (2,3) in the communicator is process 3 in group 2 of your hierarchical decomposition.
But don't read this and take away the impression that I recommend such abuse, it's just a thought.

Quantum Computing and Encryption Breaking

I read a while back that Quantum Computers can break most types of hashing and encryption in use today in a very short amount of time(I believe it was mere minutes). How is it possible? I've tried reading articles about it but I get lost at the a quantum bit can be 1, 0, or something else. Can someone explain how this relates to cracking such algorithms in plain English without all the fancy maths?
Preamble: Quantum computers are strange beasts that we really haven't yet tamed to the point of usefulness. The theory that underpins them is abstract and mathematical, so any discussion of how they can be more efficient than classical computers will inevitably be long and involved. You'll need at least an undergraduate understanding of linear algebra and quantum mechanics to understand the details, but I'll try to convey my limited understanding!
The basic premise of quantum computation is quantum superposition. The idea is that a quantum system (such as a quantum bit, or qubit, the quantum analogue of a normal bit) can, as you say, exist not only in the 0 and 1 states (called the computational basis states of the system), but also in any combination of the two (so that each has an amplitude associated with it). When the system is observed by someone, the qubit's state collapses into one of its basis states (you may have heard of the Schrödinger's cat thought experiment, which is related to this).
Because of this, a register of n qubits has 2^n basis states of its own (these are the states that you could observe the register being in; imagine a classical n-bit integer). Since the register can exist in a superposition of all these states at once, it is possible to apply a computation to all 2^n register states rather than just one of them. This is called quantum parallelism.
Because of this property of quantum computers, it may seem like they're a silver bullet that can solve any problem exponentially faster than a classical computer. But it's not that simple: the problem is that once you observe the result of your computation, it collapses (as I mentioned above) into the result of just one of the computations – and you lose all of the others.
The field of quantum computation/algorithms is all about trying to work around this problem by manipulating quantum phenomena to extract information in fewer operations than would be possible on a classical computer. It turns out that it's very difficult to contrive a "quantum algorithm" that is faster than any possible classical counterpart.
The example you ask about is that of quantum cryptanalysis. It's thought that quantum computers might be able to "break" certain encryption algorithms: specifically, the RSA algorithm, which relies on the difficulty of finding the prime factors of very large integers. The algorithm which allows for this is called Shor's algorithm, which can factor integers with polynomial time complexity. By contrast the best classical algorithm for the problem has (almost) exponential time complexity, and the problem is hence considered "intractable".
If you want a deeper understanding of this, get a few books on linear algebra and quantum mechanics and get comfortable. If you want some clarification, I'll see what I can do!
Aside: to better understand the idea of quantum superposition, think in terms of probabilities. Imagine you flip a coin and catch it on your hand, covered so that you can't see it. As a very tenuous analogy, the coin can be thought of as being in a superposition of the heads and tails "states": each one has a probability of 0.5 (and, naturally, since there are two states, these probabilities add up to 1). When you take your hand away and observe the coin directly, it collapses into either the heads state or the tails state, and so the probability of this state becomes 1, while the other becomes 0. One way to think about it, I suppose, is a set of scales that is balanced until observation, at which point it tips to one side as our knowledge of the system increases and one state becomes the "real" state.
Of course, we don't think of the coin as a quantum system: for all practical purposes, the coin has a definite state, even if we can't see it. For genuine quantum systems, however (such as an individual particle trapped in a box), we can't think about it in this way. Under the conventional interpretation of quantum mechanics, the particle fundamentally has no definite position, but exists in all possible positions at once. Only upon observation is its position constrained in space (though only to a limited degree; cf. uncertainty principle), and even this is purely random and determined only by probability.
By the way, quantum systems are not restricted to having just two observable states (those that do are called two-level systems). Some have a large but finite number, some have a countably infinite number (such as a "particle in a box" or a harmonic oscillator), and some even have an uncountably infinite number (such as a free particle's position, which isn't constrained to individual points in space).
It's highly theoretical at this point. Quantum Bits might offer the capability to break encryption, but clearly it's not at that point yet.
At the Quantum Level, the laws that govern behavior are different than in the macro level.
To answer your question, you first need to understand how encryption works.
At a basic level, encryption is the result of multiplying two extremely large prime numbers together. This super large result is divisible by 1, itself, and these two prime numbers.
One way to break encryption is to brute force guess the two prime numbers, by doing prime number factorization.
This attack is slow, and is thwarted by picking larger and larger prime numbers. YOu hear of key sizes of 40bits,56bits,128bits and now 256,512bits and beyond. Those sizes correspond to the size of the number.
The brute force algorithm (in simplified terms) might look like
for(int i = 3; i < int64.max; i++)
{
if( key / i is integral)
{
//we have a prime factor
}
}
So you want to brute force try prime numbers; well that is going to take awhile with a single computer. So you might try grouping a bunch of computers together to divide and conquer. That works, but is still slow for very large keysizes.
How a quantum bit address this is that they are both 0 and 1 at the same time. So say you have 3 quantum bits (no small feat mind you).
With 3 qbits, your program can have the values of 0-7 simulatanously
(000,001,010,011 etc)
, which includes prime numbers 3,5,7 at the same time.
so using the simple algorithm above, instead of increasing i by 1 each time, you can just divide once, and check
0,1,2,3,4,5,6,7
all at the same time.
Of course quantum bits aren't to that point yet; there is still lots of work to be done in the field; but this should give you an idea that if we could program using quanta, how we might go about cracking encryption.
The Wikipedia article does a very good job of explaining this.
In short, if you have N bits, your quantum computer can be in 2^N states at the same time. Similar conceptually to having 2^N CPU's processing with traditional bits (though not exactly the same).
A quantum computer can implement Shor's algorithm which can quickly perform prime factorization. Encryption systems are build on the assumption that large primes can not be factored in a reasonable amount of time on a classical computer.
Almost all our public-key encryptions (ex. RSA) are based solely on math, relying on the difficulty of factorization or discrete-logarithms. Both of these will be efficiently broken using quantum computers (though even after a bachelors in CS and Math, and having taken several classes on quantum mechanics, I still don't understand the algorithm).
However, hashing algorithms (Ex. SHA2) and symmetric-key encryptions (ex. AES), which are based mostly on diffusion and confusion, are still secure.
In the most basic terms, a normal no quantum computer works by operating on bits (sates of on or off) uesing boolean logic. You do this very fast for lots and lots of bits and you can solve any problem in a class of problems that are computable.
However they are "speed limits" namely something called computational complexity.This in lay mans terms means that for a given algorithm you know that the time it takes to run an algorithm (and the memory space required to run the algorithm) has a minimum bound. For example a algorithm that is O(n^2) means that for a data size of n it will require n^2 time to run.
However this kind of goes out the window when we have qbits (quantum bits) when you are doing operations on qbits that can have "in between" values. algorithms that would have very high computational complexity (like factoring huge numbers, the key to cracking many encryption algorithms) can be done in much much lower computational complexity. This is the reason that quantum computing will be able to crack encrypted streams orders of magnitude quicker then normal computers.
First of all, quantum computing is still barely out of the theoretical stage. Lots of research is going on and a few experimental quantum cells and circuits, but a "quantum computer" does not yet exist.
Second, read the wikipedia article: http://en.wikipedia.org/wiki/Quantum_computer
In particular, "In general a quantum computer with n qubits can be in an arbitrary superposition of up to 2^n different states simultaneously (this compares to a normal computer that can only be in one of these 2^n states at any one time). "
What makes cryptography secure is the use of encryption keys that are very long numbers that would take a very, very long time to factor into their constituent primes, and the keys are sufficiently long enough that brute-force attempts to try every possible key value would also take too long to complete.
Since quantum computing can (theoretically) represent a lot of states in a small number of qubit cells, and operate on all of those states simultaneously, it seems there is the potential to use quantum computing to perform brute-force try-all-possible-key-values in a very short amount of time.
If such a thing is possible, it could be the end of cryptography as we know it.
quantum computers etc all lies. I dont believe these science fiction magazines.
in fact rsa system is based on two prime numbers and their multipilation.
p1,p2 is huge primes p1xp2=N modulus.
rsa system is
like that
choose a prime number..maybe small its E public key
(p1-1)*(p2-1)=R
find a D number that makes E*D=1 mod(R)
we are sharing (E,N) data as public key publicly
we are securely saving (D,N) as private.
To solve this Rsa system cracker need to find prime factors of N.
*mass of the Universe is closer to 10^53 kg*
electron mass is 9.10938291 × 10^-31 kilograms
if we divide universe to electrons we can create 10^84 electrons.
electrons has slower speeds than light. its move frequency can be 10^26
if anybody produces electron size parallel rsa prime factor finders from all universe mass.
all universe can handle (10^84)*(10^26)= 10^110 numbers/per second.
rsa has limitles bits of alternative prime numbers. maybe 4096 bits
4096 bit rsa has 10^600 possible prime numbers to brute force.
so your universe mass quantum solver need to make tests during 10^500 years.
rsa vs universe mass quantum computer
1 - 0
maybe quantum computer can break 64/128 bits passwords. because 128 bit password has 10^39 possible brute force nodes.
This circuit is a good start to understand how qubit parallelism works. The 2-qubits-input is on the left side. Top qubit is x and bottom qubit ist y. The y qubit is 0 at the input, just like a normal bit. The x qubit on the other hand is in superposition at the input. y (+) f(x) stands here for addition modulo 2, just meaning 1+1=0, 0+1=1+0=1. But the interesting part is, since the x-qubit is in superposition, f(x) is f(0) and f(1) at the same time and we can perform the evaluation of the f function for all states simultaneously without using any (time consuming) loops. Having enough quibits we can branch this into endlessly complicating curcuits.
Even more bizarr imo. is the Grover's algorithm. As input we get here an unsorted array of integers with arraylength = n. What is the expected runtime of an algorithm, that finds the min value of this array? Well classically we have at least to check every 1..n element of the array resulting in an expected runtime of n. Not so for quantum computers, on a quantum computer we can solve this in expected runtime of maximum root(n), this means we don't even have to check every element to find the guaranteed solution...

Resources