I am wondering why the R3 protocol shows great performance, when using multiple different buffers which exhaust the registration cache. Does it not need to pin and unpin the buffers provided for sending or how does it hide this overhead? Is it always a good choice to stick to the R3 protocol?
On the bottom you see a diagram showing my observation. I used 2 nodes sending and receiving in parallel. The x-axis denotes the numbers of buffers n (each 1MB) used for sending. The main loop looks like this:
\\ Take time
for(i to 20){
for(a to n) IRecv(rec_buffer[a])
for(a to n) ISend(send_buffer[a])
waitForAllRecv()
waitForAllSend()
}
\\ Plot time
See plot at: http://i47.tinypic.com/2vkn6ty.jpg
Related
From what I have learned in my supercomputing class I know that MPI is a communicating (and data passing) interface.
I'm confused on when you run a function in a C++ program and want each processor to perform a specific task.
For example, a prime number search (very popular for supercomputers). Say I have a range of values (531-564, some arbitrary range) and say I have 50 processes I could run a series of evaluations on for each number. If root (process 0) wants to examine 531 and knowing prime numbers I can use 8 processes (1-8) to evaluate the prime status. If the number is divisible by any number 2-9 with a remainder of 0, then it is not prime.
Is it possible that for MPI which passes data to each process to have these processes perform these actions?
The hardest part for me is understanding that if I perform an action in the original C++ program the processes taking place could be allocated on several different processes, then in MPI how can I structure this? Or is my understanding completely wrong? If so how am I supposed to truly go about this path of thinking in a correct manner?
The big idea is passing data to a process versus having a function sent to a process. I'm fairly certain I'm wrong but I'm trying to back track to fix my thinking.
Each MPI process is running the same program, but that doesn't mean that they are doing the same thing. Different processes can be running different branches of the code, depending on the id (or "rank") of the process, and in effect be completely independent. Like any distributed computation, the actors do need to agree on how they will communicate.
The most basic strategy in MPI is scatter-gather, where the "master" process (usually the one with rank 0) will split an array of work equally amongst the peers (including the master process itself) by having them all call scatter, the peers will do the work, then all peers will call gather to send the results back to master.
In your prime algorithm example, build an array of integers, "scatter" it to all the peers, each peer will run through its array saving 1 if it is prime, 0 if it is not then "gather" the results to master. [In this particular example, since the input data is completely predictable based on process rank, the scatter step is unnecessary but we will do it anyway.]
As pseudo-code:
main():
int x[n], n = 100
MPI_init()
// prepare data on master
if rank == 0:
for i in 1 ... n, x[i] = i
// send data from x on root to local on each process in world
MPI_scatter(x, n, int, local, n/k, int, root, world)
for i in 1 ... n/k
result[i] = 1 // assume prime
if 2 divides local[i], result[i] = 0
if 3 divides local[i], result[i] = 0
if 5 divides local[i], result[i] = 0
if 7 divides local[i], result[i] = 0
// gather reults from local on each process in world to x on root
MPI_gather(result, n/k, int, x, n, int, root, world)
// print results
if rank == 0:
for i in 1 ... n, print i if x[i] == 1
MPI_finalize()
There are lots of details to fill in such as proper declarations, and dealing with the fact that some ranks will have fewer elements than others, using
proper C syntax, etc., but getting them right doesn't help explain the overall picture.
More fine-grained synchronization and communication is possible using direct send/recv between processes. Such programs are harder to write since the different processes may be in different states. In particular, it is important that if process a is calling MPI_send to process b, then process b had better be calling MPI_recv from a.
Let's say we have a packet of length L bits. It is transmitted from system A through three links to system B. The three links are connected by two packet switches. di, si and Ri are the length, propagation speed and transmission rate for each link, i, in the example network. Each packet switch delays each packed by dproc (processing time).
Lets also say that there are no queuing delays; so how would i go about writing a formula for computing the end-to-end delay for a packet of length L on this theoretical network?
This is what i have so far:
End-End Delay = L/R_1 + L/R_2 + L/R_3 + d_1/s_1 + d_2/s_2 + d_3/s_3 +2(d_proc)
Is this correct, if not, what is the correct formula and why so?
Yes, your formula is correct, assuming that the processing time of each switch is the same. Also, is calculating actual delay be sure to use same dimensions for units - bits and bits/s for size and transfer rate and meters and meters/s for propagation. Take note that if the switches are connected by the fiber-optic links you will have to divide speed of light by the diffraction rating of the fiber in calculations.
Im really confused over here. I am a ai programmer working on a game that is designed to detect beats in songs and some more. I have no previous knowledge about audio and just reading through whatever material i can find. While i got fft working and stuff I simply don't understand the way samples are transferred to different frequencies. Question 1, what does each frequency stands for. For the algorithm i got. I can transfer for example 1024 samples into 512 outcomes. So are they a description of the strength of each spectrum at the current second? it doesn't really make sense since what i remember is that there are 20,000hz in a 44.1khz audio recording. So how does 512 spectrum samples explain what is happening in that moment? Question 2, from what i read, its a number that represent the sound wave at this moment. However i read that by squaring both left channel and right channel, and add them together and you will get the current power level. Both these seems incoherent to my understanding, and i am really buff led so please explain away.
DFT output
the output is complex representation of phasor (Re,Im,Frequency) of basis function (usually sin wave). First item is DC offset so skip it. All the others are multiples of the same fundamental frequency (sampling rate/N). The output is symmetric (if the input is real only) so use just first half of results. Often power spectrum is used
Amplitude=sqrt(Re^2+Im^2)
which is the amplitude of basis function. If phase is needed then
phase=atan2(Im,Re)
beware DFT results are strongly dependent on the input signal shape,frequency and phase shift to your basis functions. That causes the output to vibrate/oscillate around the correct value and produce wide peaks instead of sharp ones for singular frequencies not to mention aliasing.
frequencies
if you got 44100Hz then the max output frequency is half of it that means the biggest frequency present in data is 22050Hz. The DFFT however does not contain this frequency so if you ignore the mirrored second half of results then:
for 4 samples DFT outputs frequencies are { -,11025 } Hz
for 8 samples frequencies are: { -,5512.5,11025,16537.5 } Hz
The output frequency is linear to its address from start so if you got N=512 samples
do DFFT on it
obtain first N/2=256 results
i-th sample represents frequency f=i*samplerate/N Hz
where i={ 1,...,(N/2)-1} ... skipping i=0
the image shows one of mine utility apps tighted together with
2-channel sound generator (top left)
2-channel oscilloscope (top right)
2-channel spectral analyzer (bottom) ... switched to linear frequency scale to make obvious what I mean in above text
zoom the image to see the settings ... I made it as close to the real devices as I could.
Here DCT and DFT comparison:
Here the DFT output dependency on input signal frequency aliasing by sampling rate
more channels
Summing power of channels is more safe. If you just add the channels then you could miss some data. For example let left channel is playing 1 Khz sin wave and the right exact opposite so if you just sum them then the result is zero but you can hear the sound .... (if you are not exactly in the middle between speakers). If you analyze each channel independently then you need to calculate DFFT for each channel but if you use power sum of channels (or abs sum) then you can obtain the frequencies for all channels at once , of coarse you need to scale the amplitudes ...
[Notes]
Bigger the N nicer the result (less aliasing artifacts and closer to the max frequency). For specific frequencies detection are FIR filter detectors more precise and faster.
Strongly recommend to read DFT and all sublinks there and also this plotting real time Data on (qwt) Oscillocope
i'm making network application which doesn't send good data every time (most of time they are broken) so i tought to make control sum. At the end of data i will add control sum to check if they are valid. So i'm not sure is that a good idea to multiply every data (they are from 1 to 100) by 100, 100^2, 100^3..., and sum them.
Do you have any suggestion what to do, without making really big number(there are many data in the every packet).
Example:
Data: 1,4,2,77,12,32,5,52,23
My solution:1,4,2,77,12,32,5,52,23, 100+40000+2000000+ 77*10^4 ...
When client receive the packet he will check if last data is equal to sum of other datas.
Is there any better solution?
Multiplying the data results in a very large number to transmit, and not a lot of confidence that the numbers are correct. And addition runs into potential overflow issues. That is why it is customary to use an xor.
Or you can read up on http://en.wikipedia.org/wiki/Error-correcting_code to get even fancier solutions that can detect, and sometimes correct, small numbers of errors.
Best explanation here:
http://www.textfiles.com/programming/crc.txt
CRC functions will be available in you language's networking library.
Because 128 is 10000000 in binary, there is only 1 bit for subnetting, and there are 7 bits for hosts. We're going to subneting the Class C network address 192.168.10.0.
192.168.10.0 = Network address
255.255.255.128= Subnet mask
we have a particle detector hard-wired to use 16-bit and 8-bit buffers. Every now and then, there are certain [predicted] peaks of particle fluxes passing through it; that's okay. What is not okay is that these fluxes usually reach magnitudes above the capacity of the buffers to store them; thus, overflows occur. On a chart, they look like the flux suddenly drops and begins growing again. Can you propose a [mostly] accurate method of detecting points of data suffering from an overflow?
P.S. The detector is physically inaccessible, so fixing it the 'right way' by replacing the buffers doesn't seem to be an option.
Update: Some clarifications as requested. We use python at the data processing facility; the technology used in the detector itself is pretty obscure (treat it as if it was developed by a completely unrelated third party), but it is definitely unsophisticated, i.e. not running a 'real' OS, just some low-level stuff to record the detector readings and to respond to remote commands like power cycle. Memory corruption and other problems are not an issue right now. The overflows occur simply because the designer of the detector used 16-bit buffers for counting the particle flux, and sometimes the flux exceeds 65535 particles per second.
Update 2: As several readers have pointed out, the intended solution would have something to do with analyzing the flux profile to detect sharp declines (e.g. by an order of magnitude) in an attempt to separate them from normal fluctuations. Another problem arises: can restorations (points where the original flux drops below the overflowing level) be detected by simply running the correction program against the reverted (by the x axis) flux profile?
int32[] unwrap(int16[] x)
{
// this is pseudocode
int32[] y = new int32[x.length];
y[0] = x[0];
for (i = 1:x.length-1)
{
y[i] = y[i-1] + sign_extend(x[i]-x[i-1]);
// works fine as long as the "real" value of x[i] and x[i-1]
// differ by less than 1/2 of the span of allowable values
// of x's storage type (=32768 in the case of int16)
// Otherwise there is ambiguity.
}
return y;
}
int32 sign_extend(int16 x)
{
return (int32)x; // works properly in Java and in most C compilers
}
// exercise for the reader to write similar code to unwrap 8-bit arrays
// to a 16-bit or 32-bit array
Of course, ideally you'd fix the detector software to max out at 65535 to prevent wraparound of the sort that is causing your grief. I understand that this isn't always possible, or at least isn't always possible to do quickly.
When the particle flux exceeds 65535, does it do so quickly, or does the flux gradually increase and then gradually decrease? This makes a difference in what algorithm you might use to detect this. For example, if the flux goes up slowly enough:
true flux measurement
5000 5000
10000 10000
30000 30000
50000 50000
70000 4465
90000 24465
60000 60000
30000 30000
10000 10000
then you'll tend to have a large negative drop at times when you have overflowed. A much larger negative drop than you'll have at any other time. This can serve as a signal that you've overflowed. To find the end of the overflow time period, you could look for a large jump to a value not too far from 65535.
All of this depends on the maximum true flux that is possible and on how rapidly the flux rises and falls. For example, is it possible to get more than 128k counts in one measurement period? Is it possible for one measurement to be 5000 and the next measurement to be 50000? If the data is not well-behaved enough, you may be able to make only statistical judgment about when you have overflowed.
Your question needs to provide more information about your implementation - what language/framework are you using?
Data overflows in software (which is what I think you're talking about) are bad practice and should be avoided. While you are seeing (strange data output) is only one side effect that is possible when experiencing data overflows, but it is merely the tip of the iceberg of the sorts of issues you can see.
You could quite easily experience more serious issues like memory corruption, which can cause programs to crash loudly, or worse, obscurely.
Is there any validation you can do to prevent the overflows from occurring in the first place?
I really don't think you can fix it without fixing the underlying buffers. How are you supposed to tell the difference between the sequences of values (0, 1, 2, 1, 0) and (0, 1, 65538, 1, 0)? You can't.
How about using an HMM where the hidden state is whether you are in an overflow and the emissions are observed particle flux?
The tricky part would be coming up with the probability models for the transitions (which will basically encode the time-scale of peaks) and for the emissions (which you can build if you know how the flux behaves and how overflow affects measurement). These are domain-specific questions, so there probably aren't ready-made solutions out there.
But one you have the model, everything else---fitting your data, quantifying uncertainty, simulation, etc.---is routine.
You can only do this if the actual jumps between successive values are much smaller than 65536. Otherwise, an overflow-induced valley artifact is indistinguishable from a real valley, you can only guess. You can try to match overflows to corresponding restorations, by simultaneously analysing a signal from the right and the left (assuming that there is a recognizable base line).
Other than that, all you can do is to adjust your experiment by repeating it with different original particle flows, so that real valleys will not move, but artifact ones move to the point of overflow.