How can I estimate the entropy content of this input?

How can I estimate the entropy content of this input? - encryption

I have a 1KHZ triangle wave generator that I am measuring from a PIC micro controller using the analog input. The frequency source for the triangle wave and the analog capture are separate frequency sources. The ADC captures at 100ksps with 12 [edit:10] usable bits of precision.
I want to estimate the entropy contained in the analog samples for the purpose of generating true random numbers. The two sources of entropy that I have identified are the kelvin noise, and the frequency source offsets.
From the captured waveform I can continuously distinguish about two frequencies per second and I will capture on average one kelvin input threshold upset event per second. So my estimate is about two bits of entropy per second.
Can anyone think of a way to justify a larger entropy estimate?
Based on answers to similar questions already posted on S.O., I'll add the following clarifications:
I am not particularly interested in other ideas for entropy sources, as I would still have to answer this same question for those alternate sources.
Analysis of the data itself for autocorrelation or other measures of randomness is not the correct answer as they will be wildly optimistic.

I made some progress that may help others.
Primary resource
http://en.wikipedia.org/wiki/Johnson%E2%80%93Nyquist_noise
The pin capacitance will limit the amount of measurable thermal noise to 20uV within the bandwidth of the ADC. This should be more or less the same for a pretty wide variety of controllers. Use a ~10K resistor between the signal and the pin. Smaller values will reduce the noise, but increase the possible sample rate.
The signal does not need to be random. It just needs to be evenly distributed within the range of at least a few discrete input steps. Note that outputting to a GPIO on the same clock domain as the input might not meet this requirement.
With a 10b ADC with a dynamic range of 3.3V, each discrete step is 3mV. Entropy per sample is around 20uV / 3mV = 0.006 bits per sample.
Also note that this doesn't require an analog input. You could do it with a digital input, but the bin size would be much larger (1V?), and the answer would be more like 0.000018 bits per sample. So taking an input sample every millisecond, generating a 64-bit random seed would take about an hour.

NIST Publication SP800-90B recommends min-entropy as the entropy measure. However, testing the min-entropy of an entropy source is non-trivial. See NIST SP800-90B for one way such testing can be done.

Your question is off-topic if we are talking about "physics" entropy. But we could just as easily sample your analog signal, turn it into a digital waveform, and then discuss entryopy in the information-theoretical context.
An easy and surprisingly accurate method for measuring entropy in a digital signal is to attempt to compress it using the best methods available. The higher the compression ratio, the smaller the information content.
If your goal is to generate random bits to produce a seed (as alluded to in one of the other answers), a useful approach is to compress randomness sampled from the environment (keyboard strokes, mouse movements, your analog system) using a common compression algorithm, and then discard the dictionary. What remains, will have significant information content.

Related

Which is the efficient way of converting RSSI value as distance in BLE?

I need to convert the RSSI value as distance and there is a mathematical expression available for the same to convert i.e.,
d = 10 ^ ((TxPower -Rssi) / 10n) (n ranges from 2 to 4)
It is very difficult to choose the "n" value and it gives me the inaccurate results.
Another method is, nonlinear regression model. Which one gives the best result, as I need to feed the distance values to trilateration technique?

I would like to mention that it has been determined by the industry that RSSI is a poor estimator for distance. Various factors like reflections, scattering of signals, and other physical properties of the environment make RSSI unsuitable for distance estimation. You can do a quick search yourself to find a number of articles and papers on this subject.
The RSSI value reported by most BLE devices does generally follow a "closer/stronger, farther/weaker" pattern, but using this data for something precise is not typically possible due to the nature of the BLE protocol (adaptive frequency hopping, short packets, etc.) and the behavior of 2.4 GHz radio signals in normal environments.
sources:
Silabs BLE forum
Apple forums

High entropy random data creating functions?

Are there functions which produce "infinite" amounts of high entropy data? Moreover, do functions exist which produce the same random data (sequentially) time after time?
I kind of know that they exist, but do they have a specific name?
Use case examples:
Using the function to generate 100 bits of random data. (Great!) But while maintaining high values of entropy.
Using the same function to generate 10000 bits of random data. (The first 100 bits generated are the same as the 100 bits of random data generated before). And while still maintaining high values of entropy
Further, how would I go about building these functions myself?

You are most likely looking for Pseudo-Random Number Generators.
They are initialized by a seed, thus taking in a finite amount of entropy.
Good generators have a decent entropy coming out, supposing you judge it only from its output (thus you ignore the seed and/or the algorithm to generate the numbers, otherwise the entropy is obviously 0).
Most PRNG algorithms produce sequences which are uniformly distributed by any of several tests. It is an open question, and one central to the theory and practice of cryptography, whether there is any way to distinguish the output of a high-quality PRNG from a truly random sequence without knowing the algorithm(s) used and the state with which it was initialized.
All PRNGs have a period, after which a generated sequence will restart.
The period of a PRNG is defined thus: the maximum, over all starting states, of the length of the repetition-free prefix of the sequence. The period is bounded by the number of the states, usually measured in bits. However, since the length of the period potentially doubles with each bit of "state" added, it is easy to build PRNGs with periods long enough for many practical applications.
Thus, to have two sequences of different lengths where one is the prefix of the other, you just have to run a PRNG with the same seed both times.
Building them yourself would be pretty tricky, but a rather good and simple one is the Mersenne Twister, which dates back to only 1998 and defined in a paper by Matsumoto and Nishimura [1].
A trivial example would be a linear congruential generator.
[1] Matsumoto, M.; Nishimura, T. (1998). "Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator". ACM Transactions on Modeling and Computer Simulation 8 (1): 3–30. doi:10.1145/272991.272995.

How to find total number of nodes in a Distributed hash table

How to find total number of nodes in a Distributed hash table in efficient way?

You generally do that by estimating from a small sample of the network as enumerating all nodes of a large network is prohibitively expensive for most use-cases. And would still be inaccurate due to NAT anyway. So you have to consider that you are sampling the reachable nodes.
Assuming that nodes are randomly distributed throughout the keyspace and you have some sort of distance metric in your DHT (e.g. XOR metric in Kademlia's case) you can find the median of the distances of a sample and than calculate the keyspace size divided by the average distance between nodes times.
If you use the median you may have to compensate by some factor due to the skewedness of the distribution. but my statistics are rusty, maybe someone else can chip in on that
The result will be very noisy, so you'll want to keep enough samples around for averaging. Together with the skewed distribution and the fact that everything happens at an exponential scale (twiddle one bit to the left and the population estimate suddenly doubles or halves).
I would also suggest to only base estimates on outgoing queries that you control, not on incoming traffic, as incoming traffic may be biased by some implementation details.
Another, crude way to get rough estimates is simply extrapolating from your routing table structure, assuming it scales with the network size.
Depending on your statistics prowess you might either want to do some of the following: scientific papers describing the network, steal code from existing implementations that already do estimation or do simulations over broad ranges of population sizes - simply fitting a few million random node addresses into ram and doing some calculations on them shouldn't be too difficult.
Maybe also talk to developers of existing implementations.

Finding and counting audio drop-outs in an ecological recording

I am trying to assess how many audio drop outs are in a given sound file of an ecological soundscape.
Format: Wave
Samplingrate (Hertz): 192000
Channels (Mono/Stereo): Stereo
PCM (integer format): TRUE
Bit (8/16/24/32/64): 16
My project had a two element hydrophone. The elements were different brands/models, and we are trying to determine which element preformed better in our specific experiment. One analysis we would like to conduct is measuring how often each element had drop-outs, or loss of signal. These drop-outs are not signal amplitude related, in other words, the drop-outs are not caused by maxing out the amplitude. The element or the associated electronics just failed.
I've been trying to do this in R, as that is the program I am most familiar with. I have very limited experience with Matlab and regex, but am opening to trying those programs/languages. I'm a biologist, so please excuse any ignorance.
In R I've been playing around with the package 'seewave', and while I've been able to produce some very pretty spectrograms (which, to be fair, is the only context I've previously used that package). I attempted to use the envelope and automatic temporal measurements function within seewave (timer). I got some interesting, but opposite results.
foo=readWave("Documents/DASBR/DASBR2_20131119$032011.wav", from=53, to=60, units="seconds")
timer(foo, f=96000, threshold=6.5, msmooth=c(30,5), colval="blue")
I've altered the values of msmooth and threshold countless times, but that's just fine tinkering. What this function preforms is measuring the duration between amplitude peaks at the given threshold. What I need it to do either a) find samples in the signal without amplitude or b) measure the duration between areas without amplitude. I can work with either of those outputs. Basically I want to reverse the direction the threshold is measuring, does that make sense? So therefore any sample that is below a threshold will trigger a measurement, rather than any sample that is above the threshold.
I'm still playing with seewave to see how to produce the data I need, but I'm looking for a bit of guidance. Perhaps there is a function in seewave that will accomplish what I'm trying to do more efficiently. Or, if there is anyway to output the numerical data generated from timer, I could use the 'quantmod' package function 'findValleys' to get a list of all the data gaps.
So yeah, guidance is what I'm requesting, oh data crunching gods.
Cheers.

This problem sounds reminiscent of power transfer problems often seen in electrical engineering. One way to solve the problem is to take the RMS (the root of the mean of the square) of the samples in the signal over time, averaged over short durations (perhaps a few seconds or even shorter). The durations where you see low RMS are where the dropouts are. It's analogous to the VU meters that you sometimes see on audio amplifiers - which indicate the power being transferred to the speakers from the amplifier.

I just wanted to summarize what I ended up doing so other people will be aware. Unfortunately, the RMS measurement is not what I was looking for. Though rms could technically give me a basic idea of drop-outs could occur, because I'm working with ecological recordings there are too many other factors at play.
Background: The sound streams I am working with are from a two element hydrophone, separated vertically by 2 meters and recording at 100 m below sea level. We are finding that the element sitting at ~100 meters is experiencing heavy drop outs, while the element at ~102 meters is mostly fine. We are currently attributing this to a to-be-identified electrical issue. If both elements were poised to receive auto exactly the same way, rms would work when detecting drop-outs, but because sound is received independently the rms calculation is too heavily impacted by other factors. Two meters can make a larger difference than you'd think when it comes to source levels and signal reception, it's enough for us to localize vocalizing animals (with left/right ambiguity) based on the delay between signal arrivals.
All the same, here's what I did:
library(seewave)
library(tuneR)
foo=readWave("Sound_file_Path")
L=foo#left
R=foo#right
rms(L)
rms(R)
I then looped this process through a directory, which I detail here: for.loop with WAV files
So far, this issue is still unresolved, but thank you for the discussion!
~etg

DSP method to detect specific frequency which may not be the most dominant in the current sound

I'm trying to determine the best DSP method for what I'm trying to accomplish, which is the following:
In real-time, detect the presence of a frequency from a set of different predefined frequencies (no more than 40 different frequencies all within a 1000Hz range). I need to be able to do this even when there are other frequencies (outside of this set or range) that are more dominant.
It is my understanding that FFT might not be the best method for this, because it tells you the most dominant frequency (magnitude) at any given time. This seems like it wouldn't work because if I'm trying to detect say a frequency at 1650Hz (which is present), but there's also a frequency at 500Hz which is stronger, then it's not going to tell me the current frequency is 1650Hz.
I've heard that maybe the Goertzel algorithm might be better for what I'm trying to do, which is to detect single frequencies or a set of frequencies in real-time, even within sounds that have more dominant frequencies than the ones trying to be detected .
Any guidance is greatly appreciated and please correct me if I'm wrong on these assumptions. Thanks!

In vague and somewhat inaccurate terms, the output of the FFT is the magnitude and phase of all[1] frequencies. That is, your statement, "[The FFT] tells you the most dominant frequency (magnitude) at any given time" is incorrect. The FFT is often used as a first step to determine the most dominant frequency, but that's not what it does. In fact, if you are interested in the most dominant frequency, you need to take extra steps over and beyond the FFT: you take the magnitude of all frequencies output by the FFT, and then find the maximum. The corresponding frequency is the dominant frequency.
For your application as I understand it, the FFT is the correct algorithm.
The Goertzel algorithm is closely related to the FFT. It allows for some optimization over the FFT if you are only interested in the magnitude and/or phase of a small subset of frequencies. It might be the right choice for your application depending on the number of frequencies in question, but only as an optimization -- other than performance, it won't solve any problems the FFT won't solve. Because there is more written about the FFT, I suggest you start there and use the Goertzel algorithm only if the FFT proves to not be fast enough and you can establish the Goertzel will be faster in your case.
[1] For practical purposes, what's most inaccurate about this statement is that the frequencies are grouped together in "bins". There's a limited resolution to the analysis which depends on a variety of factors.

I am leaving my other answer as-is because I think it stands on it's own.
Based on your comments and private email, the problem you are facing is most likely this: sounds, like speech, that are principally in one frequency range, have harmonics that stretch into higher frequency ranges. This problem is exacerbated by low quality microphones and electronics, but it is not caused by them and wouldn't go away even with perfect equipment. Once your signal is cluttered with noise in the same band, you can't really distinguish on from off in a simple and reliable way, because on could be caused by the noise. You could try to do some adaptive thresholding based on noise in other bands, and you'll probably get somewhere, but that's no way to build a robust system.
There are a number of ways to solve this problem, but they all involve modulating your signal and using error detection and correction. Basically, you are building a modem and/or radio. Ultimately, what I'm saying is this: you can't solve your problem on the detector alone. You need to build some redundancy into your signal, and you may need to think about other methods of detection. I know of three methods of sending complex signals:
Amplitude modulation, which is what it sounds like you are doing now.
Frequency modulation, which tends to be more robust in the face of ambient noise. (compare FM and AM radio)
Phase modulation, which is more subtle and tricky.
These methods can be combined and multiplexed in various ways. Read about them on wikipedia. Moreover, once your base signal is transmitted, you can add error correction and detection on top.
I am not an expert in this area, but off the top of my head, I am not sure you'll be able to use PM silently, and AM is simply too sensitive to noise, as you've discovered, although it might work with the right kind of redundancy. FM is probably your best bet.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex