Shannon Information weighted by probability of different types - information-theory

Suppose I have n independent types in a system, each existing with probability t_i i=1,..,n (so the sum of t_i's=1). Suppose also that I can calculate the Shannon Entropy for each type, call this value S_i.
1) Does it make sense to then calculate a weighted sum such as H= -sum_{i=1}^{n} t_i * S_i?
2) How could I compare H values of two systems with different number of types? (e.g., system 1 has n=2 types and system 2 has n=4 types).

Related

r - Estimate selection-unbiased allele frequencies with linear regression systems

I have a few data sets consisting of frequencies for i distinct alleles/SNPs of some populations. Additionally I recorded some factors that are suspicious for having changed the frequencies of these alleles within the populations in the past due to their selectional effect. It is assumed that the selection impact can be described in the form of a simple linear regression for every selection factor.
Now I'd like to estimate how the allele frequencies are expected to be under identical selectional forces (thus, I set selection=1). These new allele frequencies a'_i are derived as
a'_i = a_i - function[a_i|selection=1]
with the current frequency a_i of the allele i of a population and function[a_i|selection=1] as the estimated allele frequency under the absence of selectional forces.
However, there are some constraints for the whole process:
The minimal values of a'_i allowed is 0.
The sum of all allele frequencies a'_i has to be 1.
Usually I'd solve this problem by applying multiple linear regressions. But then the constraints are not fulfilled ...
Any idea how to approach this analysis with constraints (maybe using linear equation/regression systems or structural equation modelling)?
Here is an example data set containing allele frequencies for the ABO major allele groups (p, q, r) as well as the selection variables (x, y, z).
Although this example file only contains 3 alleles and 3 influential variables, all my data sets contain up to ~1050 alleles/SNPs and always 8 selection variables that may have (but don't have to) an impact on the allele frequencies ...
Many thanks in advance for ideas, code snippets and hints!

Calculating mutual information between two classifiers for a fixed dataset

After searching the web, I have no answer for my question. There are some formulas that, for example, can calculate the entropy between two classifiers.
How can I calculate the mutual information?
Consider we have two random variables X and Y . The mutual information between these two statistical variables is defined by I(X,Y) = H(X) + H(Y) − H(X,Y) where H(X) is the
entropy of variable X , H(X,Y) is the joint entropy of both variables, and H(Y|X) = H(X,Y) − H(X) is the conditional
entropy that measures the uncertainty of variable Y given the value of variable X. The MI between these two variables measure the amount of uncertainty reduction for one variable given the value of the other variable.

Contradiction between Pearson and Pairwise.prop.test

I have two vectors a and b with same length. The vectors contains number of times a game has been played. So for example game 1 has been played 265350 in group a while it has been played 52516 in group b.
a <- c(265350, 89148, 243182, 208991, 113090, 124698, 146574, 33649, 276435, 9320, 58630, 20139, 26178, 7837, 6405, 399)
b <- c(52516, 42840, 60571, 58355, 46975, 47262, 58197, 42074, 50090, 27198, 45491, 43048, 44512, 27266, 43519, 28766)
I want to use Pearsons Chi square test to test Independence between the two vector. In R I type
chisq.test(a,b)
and I get a p-value 0.2348 meaning that the two vectors are independent (H is true).
But when I run pairwise.prop.test(a,b) and get all the pairwise p-values and almost all of them are very low, meaning that there are pairwise dependence between the two vectors but this is in contrast to the first result. How can that be ?
The pairwise.prop.test is not the correct test for your case.
As it mentions in the documentation:
Calculate pairwise comparisons between pairs of proportions with correction for multiple testing
And also:
x (first argument).
Vector of counts of successes or a matrix with 2 columns giving the counts of successes and failures, respectively.
And
n (second argument).
Vector of counts of trials; ignored if x is a matrix.
So, x in the number of successes out of n which is the trials, i.e. x <= (less than or equal) to each pair in n. And this is why pairwise.prop.test is used for proportions. As an example imagine tossing a coin 1000 times getting heads in 550. x would be 550 and n would be 1000. In your case you do not have something similar, you just have counts of a game in two groups.
The correct hypothesis test for testing independence is the chisq.test(a,b) that you have already used and I would trust that.

Discrete Math: Given a set of integers, permute, calculate expected number of integers that remain same position

So we are given a set of integers from 0 to n. This is then randomized. The goal is to calculate the number of expected integers which remain in the same position in both lists. I have tried to set up two indicator variables for each integer and then mapping it to the two different sets, but I don't really know how to go from there.
The random variable X, representing the number of your integers which remain in the same position after randomisation, follows the binomial distribution with n+1 trials and a probability of 1/(n+1), therefore the expected number of integers remaining in place is 1.
My reasoning is:
Each integer can move to any other position in the list after randomisation, with equal probability. So whether an integer remains in place can be considered a Bernoulli distribution, with probability 1/(n+1), since there are n+1 possible position it could move to, and only 1 position for it to have remained in place.
There are therefore n+1 Bernoulli distributions, all with the same probability, and all independent of each other. (A Bernoulli distribution represents a yes / no outcome where the yes has a given probability.)
The binomial distribution is defined as the number of successes in a given number of identical independent trials, or (equivalently) the number of "yes" outcomes in a given number of independent Bernoulli distributions with the same probability.
The number of your integers which remain in place after randomisation is therefore a bimonial distribution, probability 1/(n+1) and with n+1 trials.
The mean of a binomial distribution with n trials with probability p is np, therefore in your case the expected number of integers remaining in place is (n+1) . (1/(n+1)) which is 1.
For more info on the binomial distribution, see wikipedia.

substitution matrix based on spatial autocorrelation transformation

I would like to measure the hamming sequence similarity in which the substitution costs are not based on the substitution rates in the observed sequences but based on the spatial autocorrelation within the study area of the different states (states are thus not related to DNA but something else).
I divided my study area in grid cells of equal size (e.g. 1000m) and measured how often the same "state" is observed in a neighboring cell (Rook-case). Consequently the weight matrix indicates that from state A to A (to move within the same states) has a much higher probability than to go from A to B or B to C or A to C. This already indicates that states have a high spatial autocorrelation.
The problem is, if you want to measure sequence similarity the substitution matrix should be 0 at the diagonal. Therefore I was wondering whether there is a kind of transformation to go from an "autocorrelation matrix" to a substitution matrix, with 0 values along the diagonal. By means of this we would like to account for spatial autocorrelation in the study area in our sequence similarity measure. To do my analysis I am using the package TraMineR.
Example matrix in R for sequences consisting out of four states (A,B,C,D):
Sequence example: AAAAAABBBBCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDDDDAAAAAAAAA
Autocorrelation matrix:
A = c(17.50,3.00,1.00,0.05)
B = c(3.00,10.00,2.00,1.00)
C = c(1.00,2.00,30.00,3.00)
D = c(0.05,1.00,3.00,20.00)
subm = rbind(A,B,C,D)
colnames(subm) = c("A","B","C","D")
how to transform this matrix to a substitution matrix?
First, TraMineR computes the Hamming distance, i.e., a dissimilarity, not a similarity.
The simple Hamming distance is just the count of mismatches between two sequences. For example, the Hamming distance between AABBCC and ABBBAC is 2, and between AAAAAA and AAAAAA it is 0 since there are no mismatches.
Generalized Hamming allows to weighting mismatches (not matches!) with substitution costs. For example if the substitution cost between A and B is 1.5, and is 2 between B and C, then the distance would be the weighted sum of mismatches, i.e., 3.5 between the first two sequences. It would still be zero between one sequence and itself.
From what I understand, the shown matrix is not the matrix of substitution costs. It is the matrix of what you call 'spatial autocorrelations', and you look for how you can turn this information into substitutions costs.
The idea is to assign high substitution cost (mismatch weight) when the autocorrelation (a rate in your case) is low, i.e., when there is a low probability to find say state B in the neighborhood of state A, and to assign a low substitution cost when the probability is high. Since your probability matrix is symmetric, a simple solution is to use $1 - p(A|B)$ for all off diagonal terms, and leave 0 on the diagonal for the reason explained above.
sm <- 1 - subm/100
diag(sm) <- 0
sm
For non symmetric probabilities, you could use a similar formula to the one used for deriving the costs from transition rates, i.e., $2 - p(A|B) - p(B|A)$.

Resources