Discrete Math: Given a set of integers, permute, calculate expected number of integers that remain same position - math

So we are given a set of integers from 0 to n. This is then randomized. The goal is to calculate the number of expected integers which remain in the same position in both lists. I have tried to set up two indicator variables for each integer and then mapping it to the two different sets, but I don't really know how to go from there.

The random variable X, representing the number of your integers which remain in the same position after randomisation, follows the binomial distribution with n+1 trials and a probability of 1/(n+1), therefore the expected number of integers remaining in place is 1.
My reasoning is:
Each integer can move to any other position in the list after randomisation, with equal probability. So whether an integer remains in place can be considered a Bernoulli distribution, with probability 1/(n+1), since there are n+1 possible position it could move to, and only 1 position for it to have remained in place.
There are therefore n+1 Bernoulli distributions, all with the same probability, and all independent of each other. (A Bernoulli distribution represents a yes / no outcome where the yes has a given probability.)
The binomial distribution is defined as the number of successes in a given number of identical independent trials, or (equivalently) the number of "yes" outcomes in a given number of independent Bernoulli distributions with the same probability.
The number of your integers which remain in place after randomisation is therefore a bimonial distribution, probability 1/(n+1) and with n+1 trials.
The mean of a binomial distribution with n trials with probability p is np, therefore in your case the expected number of integers remaining in place is (n+1) . (1/(n+1)) which is 1.
For more info on the binomial distribution, see wikipedia.

Related

How do I generate numbers for discrete uniform distribution in R

How do I generate numbers for discrete uniform distribution in R. I need n =100 with the interval of [1,10]. In a simple way, I need number 1 until 10 but the frequency of each number must be 10. NOTE : The number must be an integer number.

Know this "maximum of sums towards minimum" distribution?

Let we have N integer slots initially containing 0's and an infinite sequence of similar independent negative binomial variables wi ~ NB(l, q). Each next value from the sequence is added to the slot containing the minimal value. The question is: what is the distribution of the step number at which any of the slots exceeds given limit k?
If you ask, it's a simplified model of attack on vulnerable ("one unit lost means all die") stack of units in Freeciv game.

r - Estimate selection-unbiased allele frequencies with linear regression systems

I have a few data sets consisting of frequencies for i distinct alleles/SNPs of some populations. Additionally I recorded some factors that are suspicious for having changed the frequencies of these alleles within the populations in the past due to their selectional effect. It is assumed that the selection impact can be described in the form of a simple linear regression for every selection factor.
Now I'd like to estimate how the allele frequencies are expected to be under identical selectional forces (thus, I set selection=1). These new allele frequencies a'_i are derived as
a'_i = a_i - function[a_i|selection=1]
with the current frequency a_i of the allele i of a population and function[a_i|selection=1] as the estimated allele frequency under the absence of selectional forces.
However, there are some constraints for the whole process:
The minimal values of a'_i allowed is 0.
The sum of all allele frequencies a'_i has to be 1.
Usually I'd solve this problem by applying multiple linear regressions. But then the constraints are not fulfilled ...
Any idea how to approach this analysis with constraints (maybe using linear equation/regression systems or structural equation modelling)?
Here is an example data set containing allele frequencies for the ABO major allele groups (p, q, r) as well as the selection variables (x, y, z).
Although this example file only contains 3 alleles and 3 influential variables, all my data sets contain up to ~1050 alleles/SNPs and always 8 selection variables that may have (but don't have to) an impact on the allele frequencies ...
Many thanks in advance for ideas, code snippets and hints!

Contradiction between Pearson and Pairwise.prop.test

I have two vectors a and b with same length. The vectors contains number of times a game has been played. So for example game 1 has been played 265350 in group a while it has been played 52516 in group b.
a <- c(265350, 89148, 243182, 208991, 113090, 124698, 146574, 33649, 276435, 9320, 58630, 20139, 26178, 7837, 6405, 399)
b <- c(52516, 42840, 60571, 58355, 46975, 47262, 58197, 42074, 50090, 27198, 45491, 43048, 44512, 27266, 43519, 28766)
I want to use Pearsons Chi square test to test Independence between the two vector. In R I type
chisq.test(a,b)
and I get a p-value 0.2348 meaning that the two vectors are independent (H is true).
But when I run pairwise.prop.test(a,b) and get all the pairwise p-values and almost all of them are very low, meaning that there are pairwise dependence between the two vectors but this is in contrast to the first result. How can that be ?
The pairwise.prop.test is not the correct test for your case.
As it mentions in the documentation:
Calculate pairwise comparisons between pairs of proportions with correction for multiple testing
And also:
x (first argument).
Vector of counts of successes or a matrix with 2 columns giving the counts of successes and failures, respectively.
And
n (second argument).
Vector of counts of trials; ignored if x is a matrix.
So, x in the number of successes out of n which is the trials, i.e. x <= (less than or equal) to each pair in n. And this is why pairwise.prop.test is used for proportions. As an example imagine tossing a coin 1000 times getting heads in 550. x would be 550 and n would be 1000. In your case you do not have something similar, you just have counts of a game in two groups.
The correct hypothesis test for testing independence is the chisq.test(a,b) that you have already used and I would trust that.

Looking for an efficient way to compute the variances of a multinomial distribution in R

I have an R matrix which dimensions are ~20,000,000 rows by 1,000 columns. The first column represents counts and the rest of the columns represent the probabilities of a multinomial distribution of these counts. So in other words, in each row the first column is n and the rest of the k columns are the probabilities of the k categories. Another point is that the matrix is sparse, meaning that in each row there are many columns with value of 0.
Here's a toy matrix I created:
mat=rbind(c(5,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),c(2,0.2,0.2,0.2,0.2,0.2,0,0,0,0,0),c(22,0.4,0.6,0,0,0,0,0,0,0,0),c(5,0.5,0.2,0,0.1,0.2,0,0,0,0,0),c(4,0.4,0.15,0.15,0.15,0.15,0,0,0,0,0),c(10,0.6,0.1,0.1,0.1,0.1,0,0,0,0,0))
What I'd like to do is obtain an empirical measure of the variance of the counts for each category. The natural thing that comes to mind is to obtain random draws and then compute the variances over them. Something like:
draws = apply(mat,1,function(x) rmultinom(samples,x[1],x[2:ncol(mat)]))
Where say samples=100000
Then I can run an apply over draws to compute the variances.
However, for my real data dimensions this will become prohibitive at least in terms of RAM. Is whether a more efficient solution in R to this problem?
If all you need is the variance of the counts, just compute it immediately instead of returning the intermediate simulated draws.
draws = apply(mat,1,function(x) var(rmultinom(samples,x[1],x[2:ncol(mat)])))

Resources