Discrepancy Between Two Methods of Finding Information Entropy - information-theory

So I learned about the concept of information entropy from Khan Academy where is was phrased in the form of "average amount of yes or no questions needed per symbol". They also gave an alternative form using logarithms.
So let's say we have a symbol generator that produces A,B, and C.
P(A)=1/2, P(B)=1/3, and P(C)=1/6
According to their method, I would gat a chart like this:
First method
Then I would multiply their probability of occurring by the amount of questions needed for each giving
(1/2)*1+(1/3)*2+(1/6)*2 = 1.5bits
but their other method gives
-(1/2)log2(1/2)-(1/3)log2(1/3)-(1/6)log2(1/6)= 1.459... bits
The difference is small, but still significant. I've tried this with different combinations and probabilities and got similar results. Is there something I'm missing? Am I using either method wrong, or is one of them more conditional?

Your second calculation is correct.
The problem with your decision tree approach is that the decision tree is not optimal (and indeed, no binary decision tree could be for those probabilities). Your “is it B” decision node represents less than one bit of information, since once you get there you already know it’s probably B. So your decision tree represents a potential encoding of symbols which is expected to consume 1.5 bits on average, but it represents slightly less than 1.5 bits of information.
In order to have a binary tree which represents an optimal encoding, each node needs to have balanced probabilities. This is not possible if some symbol has a probability whose denominator is not a power of 2.

Related

pop.size argument in GenMatch() respectively genoud()

I am using genetic matching in R using GenMatch in order to find comparable treatment and control groups to estimate a treatment effect. The default code for matching looks as follows:
GenMatch(Tr, X, BalanceMatrix=X, estimand="ATT", M=1, weights=NULL,
pop.size = 100, max.generations=100,...)
The description for the pop.size argument in the package is:
Population Size. This is the number of individuals genoud uses to
solve the optimization problem. The theorems proving that genetic
algorithms find good solutions are asymptotic in population size.
Therefore, it is important that this value not be small. See genoud
for more details.
Looking at gnoud the additional description is:
...There are several restrictions on what the value of this number can
be. No matter what population size the user requests, the number is
automatically adjusted to make certain that the relevant restrictions
are satisfied. These restrictions originate in what is required by
several of the operators. In particular, operators 6 (Simple
Crossover) and 8 (Heuristic Crossover) require an even number of
individuals to work on—i.e., they require two parents. Therefore, the
pop.size variable and the operators sets must be such that these three
operators have an even number of individuals to work with. If this
does not occur, the population size is automatically increased until
this constraint is satisfied.
I want to know how gnoud (resp. GenMatch) incorporates the population size argument. Does the algorithm randomly select n individuals from the population for the optimization?
I had a look at the package description and the source code, but did not find a clear answer.
The word "individuals" here does not refer to individuals in the sample (i.e., individual units your dataset), but rather to virtual individuals that the genetic algorithm uses. These individuals are individual draws of a set of the variables to be optimized. They are unrelated to your sample.
The goal of genetic matching is to choose a set of scaling factors (which the Matching documentation calls weights), one for each covariate, that weight the importance of that covariate in a scaled Euclidean distance match. I'm no expert on the genetic algorithm, but my understanding of what it does is that it makes a bunch of guesses at the optimal values of these scaling factors, keeps the ones that "do the best" in the sense of optimizing the criterion (which is determined by fit.func in GenMatch()), and creates new guesses as slight perturbations of the kept guesses. It then repeats this process many times, simulating what natural selection does to optimize traits in living things. Each guess is what the word "individual" refers to in the description for pop.size, which corresponds to the number of guesses at each generation of the algorithm.
GenMatch() always uses your entire sample (unless you have provided a restriction like a caliper, exact matching requirement, or common support rule); it does not sample units from your sample to form each guess (which is what bagging in is other machine learning contexts).
Results will change over many runs because the genetic algorithm itself is a stochastic process. It may converge to a solution asymptotically, but because it is optimizing over a lumpy surface, it will find different solutions each time in finite samples with finite generations and a finite population size (i.e., pop.size).

Cluster your time-series data

I have time-series data of 12 consumers. The data corresponding to 12 consumers (named as a ... l) is
I want to cluster these consumers so that I may know which of the consumers have utmost similar consumption behavior. Accordingly, I found clustering method pamk, which automatically calculates the number of clusters in input data.
I assume that I have only two options to calculate the distance between any two time-series, i.e., Euclidean, and DTW. I tried both of them and I do get different clusters. Now the question is which one should I rely upon? and why?
When I use Eulidean distance I got following clusters:
and using DTW distance I got
Conclusion:
How will you decide which clustering approach is the best in this case?
Note: I have asked the same question on Cross-Validated also.
none of the timeseries above look similar to me. Do you see any pattern? Maybe there is no pattern?
the clustering visualizations indicate that there are no clusters, too. b and l appear to be the most unusual outliers; followed by d,e,h; but there are no clusters there.
Also try hierarchical clustering. The dendrogram may be more understandable.
But in either way, there may be no clusters. You need to be prepared for this outcome, and consider it a valid hypothesis. Double-check any result. As you have seen, pam will always return a result, and you have absolutely no means to decide which result is more "correct" than the other (most likely, neither is correct, and you should rely on neither, to answer your question).

Interpreting negative LDA classifier scores

The post
Classification functions in linear discriminant analysis in R
from user Tyler provides a function to produce the classification functions (not discriminant functions!) from an LDA model generated with lda().
I used these classification functions to calculate all classification scores for my data. I want to use the additional information e.g. to find out which was the second most probable class and to understand the development in different time slices
Now I would like to ask you for your help to interpret the following scenarios:
scores close to/exactly zero (is it possible to claim that this exact class effectively was not recognized?)
single negative scores of higher absolute value than highest positive value (Does it mean anything at all?)
results with all negative scores (in the original interpretation, the highest score determines the classification. Is this intended by the LDA or does it mean that really none of the classifications is a good fit and one could say that no known pattern could be identified?)
single very low positive values while all others are high absolute negative values (can I argue that the "signal strength" is low in this case?)
I know this is more of a statistical than a programming problem. I thought of it as a follow-up of the post at the beginning of this entry.
Thank you very much for your help!

What is the meaning of "Inf" in S_Dbw output in R commander?

I have ran clv package which consists of S_Dbw and SD validity indexes for clustering purposes in R commander. (http://cran.r-project.org/web/packages/clv/index.html)
I evaluated my clustering results from DBSCAN, K-Means, Kohonen algorithms with S_Dbw index. but for all these three algorithms S_Dbw is "Inf".
Is it "Infinite" meaning? Why did i confront with "Inf". Is there any problem in my clustering results?
In general, when is S_Dbw index result "Inf"?
Be careful when comparing different algorithms with such an index.
The reason is that the index is pretty much an algorithm in itself. One particular clustering will necessarily be the "best" for each index. The main difference between an index and an actual clustering algorithm is that the index doesn't tell you how to find the "best" solution.
Some examples: k-means minimizes the distances from cluster members to cluster centers. Single-link hierarchical clustering will find the partition with the optimal minimum distance between partitions. Well, DBSCAN will find the partitioning of the dataset, where all density-connected points are in the same partition. As such, DBSCAN is optimal - if you use the appropriate measure.
Seriously. Do not assume that because one algorithm scores higher than another in a particular measure means that the algorithm works better. All that you find out this way is that a particular algorithm is more (cor-)related to a particular measure. Think of it as a kind of correlation between the measure and the algorithm, on a conceptual level.
Using a measure for comparing different results of the same algorithm is different. Then obviously there shouldn't be a benefit from one algorithm over itself. There might still be a similar effect with respect to parameters. For example the in-cluster distances in k-means obviously should go down when you increase k.
In fact, many of the measures are not even well-defined on DBSCAN results. Because DBSCAN has the concept of noise points, which the indexes do not AFAIK.
Do not assume that the measure will either give you an indication of what is "true" or "correct". And even less, what is useful or new. Because you should be using cluster analysis not to find a mathematical optimum of a particular measure, but to learn something new and useful about your data. Which probably is not some measure number.
Back to the indices. They usually are totally designed around k-means. From a short look at S_Dbw I have the impression that the moment one "cluster" consists of a single object (e.g. a noise object in DBSCAN), the value will become infinity - aka: undefined. It seems as if the authors of that index did not consider this corner case, but only used it on toy data sets where such situations did not arise. The R implementation can't fix this, without diverting from the original index and instead turning it into yet another index. Handling noise objects and singletons is far from trivial. I have not yet seen an index that doesn't fail in one way or another - typically, a solution such as "all objects are noise" will either score perfect, or every clustering can trivially be improved by putting each noise object to the nearest non-singleton cluster. If you want your algorithm to be able to say "this object doesn't belong to any cluster" then I do not know any appropriate index.
The IEEE floating point standard defines Inf and -Inf as positive and negative infinity respectively. It means your result was too large to represent in the given number of bits.

Where are "Special Numbers" mentioned in Concrete Maths used?

I was glancing through the contents of Concrete Maths online. I had at least heard most of the functions and tricks mentioned but there is a whole section on Special Numbers. These numbers include Stirling Numbers, Eulerian Numbers, Harmonic Numbers so on. Now I have never encountered any of these weird numbers. How do they aid in computational problems? Where are they generally used?
Harmonic Numbers appear almost everywhere! Musical Harmonies, analysis of Quicksort...
Stirling Numbers (first and second kind) arise in a variety of combinatorics and partitioning problems.
Eulerian Numbers also occur several places, most notably in permutations and coefficients of polylogarithm functions.
A lot of the numbers you mentioned are used in the analysis of algorithms. You may not have these numbers in your code, but you'll need them if you want to estimate how long it will take for your code to run. You might see them in your code too. Some of these numbers are related to combinatorics, counting how many ways something can happen.
Sometimes it's not enough to know how many possibilities there are because you need to enumerate over the possibilities. Volume 4 of Knuth's TAOCP, in progress, gives the algorithms you need.
Here's an example of using Fibonacci numbers as part of a numerical integration problem.
Harmonic numbers are a discrete analog of logarithms and so they come up in difference equations just like logs come up in differential equations. Here's an example of physical applications of harmonic means, related to harmonic numbers. See the book Gamma for many examples of harmonic numbers in action, especially the chapter "It's a harmonic world."
These special numbers can help out in computational problems in many ways. For example:
You want to find out when your program to compute the GCD of 2 numbers is going to take the longest amount of time: Try 2 consecutive Fibonacci Numbers.
You want to have a rough estimate of the factorial of a large number, but your factorial program is taking too long: Use Stirling's Approximation.
You're testing for prime numbers, but for some numbers you always get the wrong answer: It could be you're using Fermat's Prime test, in which case the Carmicheal numbers are your culprits.
The most common general case I can think of is in looping. Most of the time you specify a loop using a (start;stop;step) type of syntax, in which case it may be possible to reduce the execution time by using properties of the numbers involved.
For example, summing up all the numbers from 1 to n when n is large in a loop is definitely slower than using the identity sum = n*(n + 1)/2.
There are a large number of examples like these. Many of them are in cryptography, where the security of information systems sometimes depends on tricks like these. They can also help you with performance issues, memory issues, because when you know the formula, you may find a faster/more efficient way to compute other things -- things that you actually care about.
For more information, check out wikipedia, or simply try out Project Euler. You'll start finding patterns pretty fast.
Most of these numbers count certain kinds of discrete structures (for instance, Stirling Numbers count Subsets and Cycles). Such structures, and hence these sequences, implicitly arise in the analysis of algorithms.
There is an extensive list at OEIS that lists almost all sequences that appear in Concrete Math. A short summary from that list:
Golomb's Sequence
Binomial Coefficients
Rencontres Numbers
Stirling Numbers
Eulerian Numbers
Hyperfactorials
Genocchi Numbers
You can browse the OEIS pages for the respective sequences to get detailed information about the "properties" of these sequences (though not exactly applications, if that's what you're most interested in).
Also, if you want to see real-life uses of these sequences in analysis of algorithms, flip through the index of Knuth's Art of Computer Programming, and you'll find many references to "applications" of these sequences. John D. Cook already mentioned applications of Fibonacci & Harmonic numbers; here are some more examples:
Stirling Cycle Numbers arise in the analysis of the standard algorithm that finds the maximum element of an array (TAOCP Sec. 1.2.10): How many times must the current maximum value be updated when finding the maximum value? It turns out that the probability that the maximum will need to be updated k times when finding a maximum in an array of n elements is p[n][k] = StirlingCycle[n, k+1]/n!. From this, we can derive that on the average, approximately Log(n) updates will be necessary.
Genocchi Numbers arise in connection with counting the number of BDDs that are "thin" (TAOCP 7.1.4 Exercise 174).
Not necessarily a magic number from the reference you mentioned, but nonetheless --
0x5f3759df
-- the notorious magic number used to calculate inverse square root of a number by giving a good first estimate to Newton's Approximation of Roots, often attributed to the work of John Carmack - more info here.
Not programming related, huh? :)
Is this directly programming related? Surely related, but I don't know how closely.
Special numbers, such as e, pi, etc., come up all over the place. I don't think that anyone would argue about these two. The Golden_ratio also appears with amazing frequency, in everything from art to other special numbers themselves (look at the ratio between successive Fibonacci numbers.)
Various sequences and families of numbers also appear in many places in mathematics and therefore, in programming too. A beautiful place to look is the Encyclopedia of integer sequences.
I'll suggest this is an experience thing. For example, when I took linear algebra, many, many years ago, I learned about the eigenvalues and eigenvectors of a matrix. I'll admit that I did not at all appreciate the significance of eigenvalues/eigenvectors until I saw them in use in a variety of places. In statistics, in terms of what they tell you about uncertainty of an estimate from a covariance matrix, the size and shape of a confidence ellipse, in terms of principal component analysis, or the long term state of a Markov process. In numerical methods, where they tell you about convergence of a method, be it in optimization or an ODE solver. In mechanical engineering, where you see them as principal stresses and strains.
Discussion in Reddit

Resources