Looking for an efficient way to compute the variances of a multinomial distribution in R - r

I have an R matrix which dimensions are ~20,000,000 rows by 1,000 columns. The first column represents counts and the rest of the columns represent the probabilities of a multinomial distribution of these counts. So in other words, in each row the first column is n and the rest of the k columns are the probabilities of the k categories. Another point is that the matrix is sparse, meaning that in each row there are many columns with value of 0.
Here's a toy matrix I created:
mat=rbind(c(5,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),c(2,0.2,0.2,0.2,0.2,0.2,0,0,0,0,0),c(22,0.4,0.6,0,0,0,0,0,0,0,0),c(5,0.5,0.2,0,0.1,0.2,0,0,0,0,0),c(4,0.4,0.15,0.15,0.15,0.15,0,0,0,0,0),c(10,0.6,0.1,0.1,0.1,0.1,0,0,0,0,0))
What I'd like to do is obtain an empirical measure of the variance of the counts for each category. The natural thing that comes to mind is to obtain random draws and then compute the variances over them. Something like:
draws = apply(mat,1,function(x) rmultinom(samples,x[1],x[2:ncol(mat)]))
Where say samples=100000
Then I can run an apply over draws to compute the variances.
However, for my real data dimensions this will become prohibitive at least in terms of RAM. Is whether a more efficient solution in R to this problem?

If all you need is the variance of the counts, just compute it immediately instead of returning the intermediate simulated draws.
draws = apply(mat,1,function(x) var(rmultinom(samples,x[1],x[2:ncol(mat)])))

Related

PCoA function pcoa extract vectors; percentage of variance explained

I have a dataset consisting of 132 observations and 10 variables.
These variables are all categorical. I am trying to see how my observations cluster and how they are different based on the percentage of variance. i.e I want to find out if a) there are any variables which helps to draw certain observation points apart from one another and b) if yes, what is the percentage of variance explained by it?
I was advised to run a PCoA (Principle Coordinates Analysis) on my data. I ran it using vegan and ape package. This is my code after loading my csv file into r, I call it data
#data.dis<-vegdist(data,method="gower",na.rm=TRUE)
#data.pcoa<-pcoa(data.dis)
I was then told to extract the vectors from the pcoa data and so
#data.pcoa$vectors
It then returned me 132 rows but 20 columns of values (e.g. from Axis 1 to Axis 20)
I was perplexed over why there were 20 columns of values when I only have 10 variables. I was under the impression that I would only get 10 columns. If any kind souls out there could help to explain a) what do the vectors actually represent and b) how do I get the percentage of variance explained by Axis 1 and 2?
Another question that I had was I don't really understand the purpose of extracting the eigenvalues from data.pcoa because I saw some websites doing that after running a pcoa on their distance matrix but there was no further explanation on it.
Gower index is non-Euclidean and you can expect more real axes than the number of variables in Euclidean ordination (PCoA). However, you said that your variables are categorical. I assume that in R lingo they are factors. If so, you should not use vegan::vegdist() which only accepts numeric data. Moreover, if the variable is defined as a factor, vegan::vegdist() refuses to compute the dissimilarities and gives an error. If you managed to use vegdist(), you did not properly define your variables as factors. If you really have factor variables, you should use some other package than vegan for Gower dissimilarity (there are many alternatives).
Te percentage of "variance" is a bit tricky for non-Euclidean dissimilarities which also give some negative eigenvalues corresponding to imaginary dimensions. In that case, the sum of all positive eigenvalues (real axes) is higher than the total "variance" of data. ape::pcoa() returns the information you asked in the element values. The proportion of variances explained is in its element values$Relative_eig. The total "variance" is returned in element trace. All this was documented in ?pcoa where I read it.

r - Estimate selection-unbiased allele frequencies with linear regression systems

I have a few data sets consisting of frequencies for i distinct alleles/SNPs of some populations. Additionally I recorded some factors that are suspicious for having changed the frequencies of these alleles within the populations in the past due to their selectional effect. It is assumed that the selection impact can be described in the form of a simple linear regression for every selection factor.
Now I'd like to estimate how the allele frequencies are expected to be under identical selectional forces (thus, I set selection=1). These new allele frequencies a'_i are derived as
a'_i = a_i - function[a_i|selection=1]
with the current frequency a_i of the allele i of a population and function[a_i|selection=1] as the estimated allele frequency under the absence of selectional forces.
However, there are some constraints for the whole process:
The minimal values of a'_i allowed is 0.
The sum of all allele frequencies a'_i has to be 1.
Usually I'd solve this problem by applying multiple linear regressions. But then the constraints are not fulfilled ...
Any idea how to approach this analysis with constraints (maybe using linear equation/regression systems or structural equation modelling)?
Here is an example data set containing allele frequencies for the ABO major allele groups (p, q, r) as well as the selection variables (x, y, z).
Although this example file only contains 3 alleles and 3 influential variables, all my data sets contain up to ~1050 alleles/SNPs and always 8 selection variables that may have (but don't have to) an impact on the allele frequencies ...
Many thanks in advance for ideas, code snippets and hints!

Weighted observation frequency clustering using hclust in R

I have a large matrix of 500K observations to cluster using hierarchical clustering. Due to the large size, i do not have the computing power to calculate the distance matrix.
To overcome this problem I chose to aggregate my matrix to merge those observations which were identical to reduce my matrix to about 10K observations. I have the frequency for each of the rows in this aggregated matrix. I now need to incorporate this frequency as a weight in my hierarchical clustering.
The data is a mixture of numerical and categorical variables for the 500K observations so i have used the daisy package to calculate the gower dissimilarity for my aggregated dataset. I want to use hclust in the stats package for the aggregated dataset however i want to take into account the frequency of each observation. From the help information for hclust the arguments are as follows:
hclust(d, method = "complete", members = NULL)
The information for the members argument is:, NULL or a vector with length size of d. See the ‘Details’ section. When you look at the details section you get: If members != NULL, then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means.
From the above description, i am unsure if i can assign my frequency weights to the members arguments as it is not clear if this is the purpose of this argument. I would like to use it like this:
hclust(d, method = "complete", members = df$freq)
Where df$freq is the frequency of each row in the aggregated matrix. So if a row is duplicated 10 times this value would be 10.
If anyone can help me that would be great,
Thanks
Yes, this should work fine for most linkages, in particular single, group average and complete linkage. For ward etc. you need to correctly take the weights into account yourself.
But even that part is not hard. Just make sure to use the cluster sizes, because you need to pass the distance of two clusters, not two points. So the matrix should contain the distance of n1 points at location x and n2 points at location y. For min/max/mean this n disappears or cancels out. For ward, you should get a SSQ like formula.

Design Covariance Matrix in a simulation study in R in an efficient way

In my simulation study I need to come up with a covariance matrix for multivariate data.
My data:
dataset=data.frame(observation=rep(1:8,2),plot=rep(1:4,each=2),time=rep(1:2,8),treatment=rep(c("A","B","A","B"),each=4),OutputVariable=rep(c("P","Q"),each=8))
This dataset is multivariate, for every observation (1:8) there is more than one result. In this case, we observe a value for OutputVariable P and for OutputVariable Q at the same time. Note that actual outputs are not in this dataset as I will generate them at a later stage.
The desired Covariance Matrix would be 16x16. Where CovarMat[2,9] indicates the Covariance between the second line (Observation 2 of variable P) and the 9th line (Observation 1 of variable Q) in the dataset.
The value of, for instance, CovarMat[2,9] is based on rules like these:
CovarMat[2,9]=0
If dataset$plot[2]==dataset$plot[9] then CovarMat[2,9]=CovarMat[2,9]+1.5
If dataset$time[2]==dataset$time[9] then CovarMat[2,9]=CovarMat[2,9]+1.5
If (dataset$plot[2]==dataset$plot[9])&(dataset$time[2]==dataset$time[9]) then CovarMat[2,9]=CovarMat[2,9]+3
If abs(dataset$time[2]-dataset$time[9])=1 then CovarMat[2,9]=CovarMat[2,9]+2
Using For-loops thats easy enough (and thats what I did up to now). But my current dataset is 13,200 lines. And thus my CovarMat consists of 174,240,000 cells. Therefore, I am in desperate need of a more efficient way.

fisher's exact test (R) - simulated p-value does not vary

I have a problem using fisher’s exact test in R with a simulated p-value, but I don’t know if it’s a caused by “the technique” ( R ) or if it is (statistically) intended to work that way.
One of the datasets I want to work with:
matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,3,57,11,2,87,1,2,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704,40,759,404,151,1491,9,40,144),ncol=2,nrow=27)
The resulting p-value is always the same, no matter how often I repeat the test:
p = 1 / (B+1)
(B = number of replicates used in the Monte Carlo test)
When I shorten the matrix it works if the number of rows is lower than 19. Nevertheless it is not a matter of number of cells in the matrix. After transforming it into a matrix with 3 columns it still does not work, although it does when using the same numbers in just two columns.
Varying simulated p-values:
>a <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=2,nrow=18)
>b <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704),ncol=2,nrow=19)
>c <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=3,nrow=12)
>fisher.test(a,simulate.p.value=TRUE)$p.value
Number of cells in a and b are the same, but the simulation only works with matrix a.
Does anyone know if it is a statistical issue or a R issue and, if so, how it could be solved?
Thanks for your suggestions
I think that you are just seeing a very significant result. The p-value is being computed as the number of simulated (and the original) matrices that are as extreme or more extreme than the original. If none of the randomly generated matrices are as or more extreme then the p-value will just be 1 (the original matrix is as extreme as itself) divided by the total number of matrices which is $B+1$ (the B simulated and the 1 original matrix). If you run the function with enough samples (high enough B) then you will start to see some of the random matrices as or more extreme and therefor varying p-values, but the time to do so is probably not reasonable.

Resources