How do I generate numbers for discrete uniform distribution in R. I need n =100 with the interval of [1,10]. In a simple way, I need number 1 until 10 but the frequency of each number must be 10. NOTE : The number must be an integer number.
Let we have N integer slots initially containing 0's and an infinite sequence of similar independent negative binomial variables wi ~ NB(l, q). Each next value from the sequence is added to the slot containing the minimal value. The question is: what is the distribution of the step number at which any of the slots exceeds given limit k?
If you ask, it's a simplified model of attack on vulnerable ("one unit lost means all die") stack of units in Freeciv game.
I have a few data sets consisting of frequencies for i distinct alleles/SNPs of some populations. Additionally I recorded some factors that are suspicious for having changed the frequencies of these alleles within the populations in the past due to their selectional effect. It is assumed that the selection impact can be described in the form of a simple linear regression for every selection factor.
Now I'd like to estimate how the allele frequencies are expected to be under identical selectional forces (thus, I set selection=1). These new allele frequencies a'_i are derived as
a'_i = a_i - function[a_i|selection=1]
with the current frequency a_i of the allele i of a population and function[a_i|selection=1] as the estimated allele frequency under the absence of selectional forces.
However, there are some constraints for the whole process:
The minimal values of a'_i allowed is 0.
The sum of all allele frequencies a'_i has to be 1.
Usually I'd solve this problem by applying multiple linear regressions. But then the constraints are not fulfilled ...
Any idea how to approach this analysis with constraints (maybe using linear equation/regression systems or structural equation modelling)?
Here is an example data set containing allele frequencies for the ABO major allele groups (p, q, r) as well as the selection variables (x, y, z).
Although this example file only contains 3 alleles and 3 influential variables, all my data sets contain up to ~1050 alleles/SNPs and always 8 selection variables that may have (but don't have to) an impact on the allele frequencies ...
Many thanks in advance for ideas, code snippets and hints!
I have two vectors a and b with same length. The vectors contains number of times a game has been played. So for example game 1 has been played 265350 in group a while it has been played 52516 in group b.
a <- c(265350, 89148, 243182, 208991, 113090, 124698, 146574, 33649, 276435, 9320, 58630, 20139, 26178, 7837, 6405, 399)
b <- c(52516, 42840, 60571, 58355, 46975, 47262, 58197, 42074, 50090, 27198, 45491, 43048, 44512, 27266, 43519, 28766)
I want to use Pearsons Chi square test to test Independence between the two vector. In R I type
chisq.test(a,b)
and I get a p-value 0.2348 meaning that the two vectors are independent (H is true).
But when I run pairwise.prop.test(a,b) and get all the pairwise p-values and almost all of them are very low, meaning that there are pairwise dependence between the two vectors but this is in contrast to the first result. How can that be ?
The pairwise.prop.test is not the correct test for your case.
As it mentions in the documentation:
Calculate pairwise comparisons between pairs of proportions with correction for multiple testing
And also:
x (first argument).
Vector of counts of successes or a matrix with 2 columns giving the counts of successes and failures, respectively.
And
n (second argument).
Vector of counts of trials; ignored if x is a matrix.
So, x in the number of successes out of n which is the trials, i.e. x <= (less than or equal) to each pair in n. And this is why pairwise.prop.test is used for proportions. As an example imagine tossing a coin 1000 times getting heads in 550. x would be 550 and n would be 1000. In your case you do not have something similar, you just have counts of a game in two groups.
The correct hypothesis test for testing independence is the chisq.test(a,b) that you have already used and I would trust that.
I have an R matrix which dimensions are ~20,000,000 rows by 1,000 columns. The first column represents counts and the rest of the columns represent the probabilities of a multinomial distribution of these counts. So in other words, in each row the first column is n and the rest of the k columns are the probabilities of the k categories. Another point is that the matrix is sparse, meaning that in each row there are many columns with value of 0.
Here's a toy matrix I created:
mat=rbind(c(5,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),c(2,0.2,0.2,0.2,0.2,0.2,0,0,0,0,0),c(22,0.4,0.6,0,0,0,0,0,0,0,0),c(5,0.5,0.2,0,0.1,0.2,0,0,0,0,0),c(4,0.4,0.15,0.15,0.15,0.15,0,0,0,0,0),c(10,0.6,0.1,0.1,0.1,0.1,0,0,0,0,0))
What I'd like to do is obtain an empirical measure of the variance of the counts for each category. The natural thing that comes to mind is to obtain random draws and then compute the variances over them. Something like:
draws = apply(mat,1,function(x) rmultinom(samples,x[1],x[2:ncol(mat)]))
Where say samples=100000
Then I can run an apply over draws to compute the variances.
However, for my real data dimensions this will become prohibitive at least in terms of RAM. Is whether a more efficient solution in R to this problem?
If all you need is the variance of the counts, just compute it immediately instead of returning the intermediate simulated draws.
draws = apply(mat,1,function(x) var(rmultinom(samples,x[1],x[2:ncol(mat)])))