i don't really understand FFT and sample rates - math

Im really confused over here. I am a ai programmer working on a game that is designed to detect beats in songs and some more. I have no previous knowledge about audio and just reading through whatever material i can find. While i got fft working and stuff I simply don't understand the way samples are transferred to different frequencies. Question 1, what does each frequency stands for. For the algorithm i got. I can transfer for example 1024 samples into 512 outcomes. So are they a description of the strength of each spectrum at the current second? it doesn't really make sense since what i remember is that there are 20,000hz in a 44.1khz audio recording. So how does 512 spectrum samples explain what is happening in that moment? Question 2, from what i read, its a number that represent the sound wave at this moment. However i read that by squaring both left channel and right channel, and add them together and you will get the current power level. Both these seems incoherent to my understanding, and i am really buff led so please explain away.

DFT output
the output is complex representation of phasor (Re,Im,Frequency) of basis function (usually sin wave). First item is DC offset so skip it. All the others are multiples of the same fundamental frequency (sampling rate/N). The output is symmetric (if the input is real only) so use just first half of results. Often power spectrum is used
Amplitude=sqrt(Re^2+Im^2)
which is the amplitude of basis function. If phase is needed then
phase=atan2(Im,Re)
beware DFT results are strongly dependent on the input signal shape,frequency and phase shift to your basis functions. That causes the output to vibrate/oscillate around the correct value and produce wide peaks instead of sharp ones for singular frequencies not to mention aliasing.
frequencies
if you got 44100Hz then the max output frequency is half of it that means the biggest frequency present in data is 22050Hz. The DFFT however does not contain this frequency so if you ignore the mirrored second half of results then:
for 4 samples DFT outputs frequencies are { -,11025 } Hz
for 8 samples frequencies are: { -,5512.5,11025,16537.5 } Hz
The output frequency is linear to its address from start so if you got N=512 samples
do DFFT on it
obtain first N/2=256 results
i-th sample represents frequency f=i*samplerate/N Hz
where i={ 1,...,(N/2)-1} ... skipping i=0
the image shows one of mine utility apps tighted together with
2-channel sound generator (top left)
2-channel oscilloscope (top right)
2-channel spectral analyzer (bottom) ... switched to linear frequency scale to make obvious what I mean in above text
zoom the image to see the settings ... I made it as close to the real devices as I could.
Here DCT and DFT comparison:
Here the DFT output dependency on input signal frequency aliasing by sampling rate
more channels
Summing power of channels is more safe. If you just add the channels then you could miss some data. For example let left channel is playing 1 Khz sin wave and the right exact opposite so if you just sum them then the result is zero but you can hear the sound .... (if you are not exactly in the middle between speakers). If you analyze each channel independently then you need to calculate DFFT for each channel but if you use power sum of channels (or abs sum) then you can obtain the frequencies for all channels at once , of coarse you need to scale the amplitudes ...
[Notes]
Bigger the N nicer the result (less aliasing artifacts and closer to the max frequency). For specific frequencies detection are FIR filter detectors more precise and faster.
Strongly recommend to read DFT and all sublinks there and also this plotting real time Data on (qwt) Oscillocope

Related

Why are frequencies represented as complex numbers?

In a FFT, the resulting frequencies represent both magnitude and phase. Since each frequency element in the output array of an FFT essentially just describes the SIN wave at each frequency interval, shouldn't it just be magnitude that we need? What is the significance of the phase represented in the imaginary part of the complex number?
To clarify my question, to be able to put a meaning to the phase of a wave, I need a reference point or reference wave.
When an FFT reports the phase for each sin wave in the resulting frequency domain output, what is the reference wave with respect to which it is reporting the phase?
Because the phase of different components affects the total signal. The two functions in the plot are both summed from sine waves with periods of pi and 2pi, but the phase of the p=2pi sine waves are different. As you can see, the outputs are not the same.
Well in layman's words: magnitude tells you how much of that frequency is there, and phase tells you where it is.
FFTs (there is more than one convention) usually report phase with respect to the zero-th sample. Or if you use FFTShift, with respect to the sample at the center of an FFT window that indexes from 0 to N-1 (e.g. sample number N/2 = sin(0) for a phase of 0). The latter convention, centering phase using FFTShift, is often better, as there can be a big discontinuity at the edges of an FFT aperture, or nearly no data at the edges after using a tapered window function.
If you use FFTShift to center the phase reference, zero phase represents an even function, and a phase of pi or -pi represents an odd function in the window.
Human hearing, in general, can't discriminate the phase of a single sound source. BUT, phase is important when dealing with combined sounds, or multiple sine waves of the same frequency. Sinusoids that are in phase add or sum. Sinusoids of the opposite phase cancel. So if you have the FFT of, say, two loudspeaker responses without phase, you won't know whether they will sound great or horrible together.

Statistical best fit for gesture detection

I have a linear regression equation from school , which gives a value between 1 and -1 indicative of whether or not a set of data points are close enough to a linear function
and the equation given here
http://people.hofstra.edu/stefan_waner/realworld/calctopic1/regression.html
under best fit of a line. I would like to use these to do simple gesture detection based on a point in 3-space (x,y,z) - forward, back, left, right, up, down. First I would see if they fall on a line in 2 of the 3 dimensions, then I would see if that line's slope approached zero or infinity.
Is this fast enough for functional gesture recognition? If not, could someone propose an alternative algorithm?
If I've understood your question correctly then (1) the calculation you describe here would probably be plenty fast enough, (2) it may not actually do what you want, and (3) the stuff that'll be slow in an actual implementation would lie elsewhere.
So, I think you're proposing to do this. (1) Identify the positions of ... something ... (the user's hand, perhaps) in three-dimensional space, at several successive times. (2) For (say) each of {x,y} and {x,z}, look at those two coordinates of each point, compute the correlation coefficient (which is what your formula describes) and see whether it's close to +-1. (3) If both correlation coefficients are close to +-1 then the points lie approximately on a straight line; calculate the gradient of that line (using a formula similar to that of the correlation coefficient). (4) If the gradients are both very close to 0 or +- infinity, then your line is approximately parallel to one axis, which is the case you're trying to recognize.
1: Is it fast enough? You might perhaps be sampling at 50 frames per second or thereabouts, and your gestures might take a second to execute. So you'll have somewhere on the order of 50 positions. So, the total number of arithmetic operations you'll need is maybe a few hundred (including a modest number of square roots). In the worst case, you might be doing this in emulated floating-point on a slow ARM processor or something; in that case, each arithmetic operation might take a couple of hundred cycles, so the whole thing might be 100k cycles, which for a really slow processor running at 100MHz would be about a millisecond. You're not going to have any problem with the time taken to do this calculation.
2: Is it the right thing? It's not clear that it's the right calculation. For instance, suppose your user's hand moves back and forth rapidly several times along the x-axis; that will give you a positive result; is that what you want? Suppose the user attempts the gesture you want but moves at slightly the wrong angle; you may get a negative result. Suppose they move exactly along the x-axis for a bit and then along the y-axis for a bit; then the projections onto the {x,y}, {x,z} and {y,z} planes will all pass your test. These all seem like results you might not want.
3: Is it where the real cost will lie? This all assumes you've already got (x,y,z) coordinates. Getting those is probably going to be more expensive than processing them. For instance, if you have a camera-based system of some kind then there'll be some nontrivial image processing for every frame. Or perhaps you're integrating up data from accelerometers (which, by the way, is likely to give nasty inaccurate position results); the chances are that you're doing some filtering and other calculations to get position data. I bet that the cost of performing a calculation like this one will be substantially less than the cost of getting the coordinates in the first place.

Converting Real and Imaginary FFT output to Frequency and Amplitude

I'm designing a real time Audio Analyser to be embedded on a FPGA chip. The finished system will read in a live audio stream and output frequency and amplitude pairs for the X most prevalent frequencies.
I've managed to implement the FFT so far, but it's current output is just the real and imaginary parts for each window, and what I want to know is, how do I convert this into the frequency and amplitude pairs?
I've been doing some reading on the FFT, and I see how they can be turned into a magnitude and phase relationship but I need a format that someone without a knowledge of complex mathematics could read!
Thanks
Thanks for these quick responses!
The output from the FFT I'm getting at the moment is a continuous stream of real and imaginary pairs. I'm not sure whether to break these up into packets of the same size as my input packets (64 values), and treat them as an array, or deal with them individually.
The sample rate, I have no problem with. As I configured the FFT myself, I know that it's running off the global clock of 50MHz. As for the Array Index (if the output is an array of course...), I have no idea.
If we say that the output is a series of One-Dimensional arrays of 64 complex values:
1) How do I find the array index [i]?
2) Will each array return a single frequency part, or a number of them?
Thankyou so much for all your help! I'd be lost without it.
Well, the bad news is, there's no way around needing to understand complex numbers. The good news is, just because they're called complex numbers doesn't mean they're, y'know, complicated. So first, check out the wikipedia page, and for an audio application I'd say, read down to about section 3.2, maybe skipping the section on square roots: http://en.wikipedia.org/wiki/Complex_number
What that's telling you is that if you have a complex number, a + bi, you can picture it as living in the x,y plane at location (a,b). To get the magnitude and phase, all you have to do is find two quantities:
The distance from the origin of the plane, which is the magnitude, and
The angle from the x-axis, which is the phase.
The magnitude is simple enough: sqrt(a^2 + b^2).
The phase is equally simple: atan2(b,a).
The FFT result will give you an array of complex values. The twice the magnitude (square root of sum of the complex components squared) of each array element is an amplitude. Or do a log magnitude if you want a dB scale. The array index will give you the center of the frequency bin with that amplitude. You need to know the sample rate and length to get the frequency of each array element or bin.
f[i] = i * sampleRate / fftLength
for the first half of the array (the other half is just duplicate information in the form of complex conjugates for real audio input).
The frequency of each FFT result bin may be different from any actual spectral frequencies present in the audio signal, due to windowing or so-called spectral leakage. Look up frequency estimation methods for the details.

fourier transform to transpose key of a wav file

I want to write an app to transpose the key a wav file plays in (for fun, I know there are apps that already do this)... my main understanding of how this might be accomplished is to
1) chop the audio file into very small blocks (say 1/10 a second)
2) run an FFT on each block
3) phase shift the frequency space up or down depending on what key I want
4) use an inverse FFT to return each block to the time domain
5) glue all the blocks together
But now I'm wondering if the transformed blocks would no longer be continuous when I try to glue them back together. Are there ideas how I should do this to guarantee continuity, or am I just worrying about nothing?
Overlap the time samples for each block by half so that each block after the first consists of the last N/2 samples from the previous block and N/2 new samples. Be sure to apply some window to the samples before the transform.
After shifting the frequency, perform an inverse FFT and use the middle N/2 samples from each block. You'll need to adjust the final gain after the IFFT.
Of course, mixing the time samples with a sine wave and then low pass filtering will provide the same shift in the time domain as well. The frequency of the mixer would be the desired frequency difference.
For speech you might want to look at PSOLA - this is a popular algorithm for pitch-shifting and/or time stretching/compression which is a little more sophisticated than the basic overlap-add method, but not much more complex.
If you need to process non-speech samples, e.g. music, then there are several possibilities, however the overlap-add FFT/modify/IFFT approach mentioned in other answers is probably the best bet.
Found this great article on the subject, for anyone trying it in the future!
You may have to find a zero-crossing between the blocks to glue the individual wavs back together. Otherwise you may find that you are getting clicks or pops between the blocks.

Finding area of straight line with graph (Math question but needed for flot)

Okay, so this is a straight math question and I read up on meta that those need to be written to sound like programming questions. I'll do my best...
So I have graph made in flot that shows the network usage (in bytes/sec) for the user. The data is 4 minutes apart when there is activity, and otherwise set at the start of the usage range (let's say day 1) and the end of the range (day 7). The data is coming from a CGI script I have no control over, so I'm fairly limited in what I can provide the user.
I never took trig or calculus, so I'm pretty much in over my head. What I want is for the user to have the option to click any point on the graph and see their bandwidth usage for that moment. Since the lines between real data points are drawn straight, this can be done by getting the points before and after where the user has clicked and finding the y-interval.
It took me weeks to finally get a helpful math person to explain this to me. Everyone else has insisted on trying to teach me Riemann sum techniques and all sorts of other heavy stuff that not only is confusing to me, doesn't seem necessary for the problem.
But I also want the user to be able to highlight the graph from two arbitrary points on the y-axis (time) to get the amount of network usage total during that range. I know this would be inaccurate, but I need it to be the right inaccurate using a solid equation.
I thought this was the area under the line, but experiments with much simpler graphs makes this seem just far too high. I figured out I could take the distance from y2 - y1 and multiply it by x2 - x1 and then divide by two to get the area of the graph below the line like a triangle, but again, the numbers seemed to high. (maybe they are just big numbers and I don't get this math stuff at all).
So what I need, if anyone would be really awesome enough to provide it before this question is closed down for being too pure-math, is either the name of the concept I should be researching or the equation itself. Or the bad news that I do need advanced math to get an accurate result.
I am not bad at math, just as a last note, I just am not familiar with math beyond 10th grade and so I need some place to start. All the math sites seem to keep it too simple or way over my paygrade.
If I understood correctly what you're asking (and that is somewhat doubtful), you should find what you seek in these links:
Linear interpolation
(calculating the value of the point in between)
Trapezoidal rule
(calculating the area below the "curve")
*****Edit, so we can get this over :) without much ado:*****
So I have graph made in flot that shows the network usage (in bytes/sec) for the user. The data is 4 minutes apart when there is activity, and otherwise set at the start of the usage range (let's say day 1) and the end of the range (day 7). The data is coming from a CGI script I have no control over, so I'm fairly limited in what I can provide the user.
What is a "flot" ?
Okey, so you have speed on y axis [in bytes/sec]; and time on x axis in [sec], right?
That means, that if you're flotting (I'm bored, yes :) speed over time, in linear segments, interpolating at some particular point in time you'll get speed at that particular point in time.
If you wish to calculate how much bandwidth you've spend, you need to determine the area beneath that curve. The area from point "a" to point "b" will determine the spended bandwidth in [bytes] in that time period.
It took me weeks to finally get a helpful math person to explain this to me. Everyone else has insisted on trying to teach me Riemann sum techniques and all sorts of other heavy stuff that not only is confusing to me, doesn't seem necessary for the problem.
In the immortal words of Snoopy: "Good grief !"
But I also want the user to be able to highlight the graph from two arbitrary points on the y-axis (time) to get the amount of network usage total during that range. I know this would be inaccurate, but I need it to be the right inaccurate using a solid equation.
It would not be inaccurate.
It would be actually perfectly accurate (well, apart from roundoff error in bytes :), since you're using linear interpolation on linear segments.
I thought this was the area under the line, but experiments with much simpler graphs makes this seem just far too high. I figured out I could take the distance from y2 - y1 and multiply it by x2 - x1 and then divide by two to get the area of the graph below the line like a triangle, but again, the numbers seemed to high. (maybe they are just big numbers and I don't get this math stuff at all).
"like a triangle" --> should be "like a trapezoid"
If you do deltax*(y2-y1)/2 you will get the area, yes (this works only for linear segments). This is the basis principle of trapezoidal rule.
If you're uncertain about what you're calculating use dimensional analysis: speed is in bytes/sec, time is in sec, bandwidth is in bytes. Multiplying speed*time=bandwidth, and so on.
What I want is for the user to have
the option to click any point on the
graph and see their bandwidth usage
for that moment. Since the lines
between real data points are drawn
straight, this can be done by getting
the points before and after where the
user has clicked and finding the
y-interval.
Yes, that's a good way to find that instantaneous value. When you report that value back, it's in the same units as the y-axis, so that means bytes/sec, right?
I don't know how rapidly the rate changes between points, but it's even simpler if you simply pick the closest point and report its value. You simplify your problem without sacrificing too much accuracy.
I thought this was the area under the
line, but experiments with much
simpler graphs makes this seem just
far too high. I figured out I could
take the distance from y2 - y1 and
multiply it by x2 - x1 and then divide
by two to get the area of the graph
below the line like a triangle, but
again, the numbers seemed to high.
(maybe they are just big numbers and I
don't get this math stuff at all).
To calculate the total bytes over a given time interval, you should find the index closest to the starting and ending point and multiply the value of y by the spacing of your x-points and add them all together. That will give you the total # of bytes consumed during that time interval, but there's one more wrinkle you might have forgotten.
You said that the points come in "4 minutes apart", and your y-axis is in bytes/second. Remember that units matter. Your area is the sum of bytes/second times a spacing in minutes. To make the units come out right you have to multiply by 60 seconds/minute to get the final value of bytes that you want.
If that "too high" value is still off, consider units again. It's 1024 bytes per kbyte, and 1024*1024 bytes per MB. Check the units of the values you're checking the calculation against.
UPDATE:
No wonder you're having problems. Your original question CLEARLY stated bytes/sec. Even this question is imprecise and confusing. How did you arrive at "amount of data" at a given time stamp? Are those the total bits transferred since the last time stamp? If yes, simply add the values between the start and end of the interval you want and convert to the units convenient for you.
The network usage total is not in bytes (kilo-, mega-, whatever) per second. It would be in just straight bytes (or kilo-, or whatever).
For example, 2 megabytes per second over an interval of 10 seconds would be 20 megabytes total. It would not be 20 megabytes per second.
Or do you perhaps want average bytes per second over an interval?
This would be a lot easier for you if you would accept that there is well-established terminology for the concepts that you are having trouble expressing concisely or accurately, and that these mathematical terms have been around far longer than you. Since you've clearly gone through most of the trouble of understanding the concepts, you might as well break down and start calling them by their proper names.
That said:
There are 2 obvious ways to graph bandwidth, and two ways you might be getting the bandwidth data from the server. First, there's the cumulative usage function, which for any time is simply the total amount of data transferred since the start of the measurement. If you plot this function, you get a graph that never decreases (since you can't un-download something). The units of the values of this function will be bytes or kB or something like that.
What users are typically interested is in the instantaneous usage function, which is an indicator of how much bandwidth you are using right now. This is what users typically want to see. In mathematical terms, this is the derivative of the cumulative function. This derivative can take on any value from 0 (you aren't downloading) to the rated speed of your network link (indicating that you're pushing as much data as possible through your connection). The units of this function are bytes per second, or something related like Mbps (megabits per second).
You can approximate the instantaneous bandwidth with the average data usage over the past few seconds. This is computed as
(number of bytes transferred)
-----------------------------------------------------------------
(number of seconds that elapsed while transferring those bytes)
Generally speaking, the smaller the time interval, the more accurate the approximation. For simplicity's sake, you usually want to compute this as "number of bytes transferred since last report" divided by "number of seconds since last report".
As an example, if the server is giving you a report every 4 minutes of "total number of bytes transferred today", then it is giving you the cumulative function and you need to approximate the derivative. The instantaneous bandwidth usage rate you can report to users is:
(total transferred as of now) - (total as of 4 minutes ago) bytes
-----------------------------------------------------------
4*60 seconds
If the server is giving you reports of the form "number of bytes transferred since last report", then you can directly report this to users and plot that data relative to time. On the other hand, if the user (or you) is concerned about a quota on total bytes transferred per day, then you will need to transform the (approximately) instantaneous data you have into the cumulative data. This process, known as computing the integral, is the opposite of computing the derivative, and is in some ways conceptually simpler. If you've kept track of each of the reports from the server and the timestamp, then for each time, the value you plot is the total of all the reports that came in before that time. If you're doing this in realtime, then every time you get a new report, the graph jumps up by the amount in that report.
I am not bad at math, ... I just am not familiar with math beyond 10th grade
This is like saying "I'm not bad at programming, I have no trouble with ifs and loops but I never got around to writing more than one function."
I would suggest you enrol in a maths class of some kind. An understanding of matrices and the basics of calculus gives you an appreciation of many things, and can be useful in all sorts of areas. You'll be able to understand more of Wikipedia articles and SO answers - and questions!
If you can't afford that, try to find some lecture videos or something.
Everyone else has insisted on trying to teach me Riemann sum techniques
I can't see why. You don't need them for this - though if you had learned them, I expect you would find it easier to come up with a solution. You see, Riemann sums attempt to give you a "familiar" notion of area. The sort of area you (hopefully) learned years ago.
Getting the area below your usage graph between two points will tell you (approximately) how much was used over that period.
How do you find the area of a floor plan? You break it up into rectangles and triangles, find the area of each, and add them together. You can do the same thing with your graph, basically. Someone has worked out a simple way of doing this called the trapezoidal rule. It's just a matter of choosing how to divide your graph into strips, and in your case this is easy: just use the data points themselves as dividers. (You'll also need to work out the value of the graph at the left and right ends of the region selected by the user, using linear interpolation.)
If there's anything I've said that isn't clear to you (as there may well be), please leave a comment.

Resources