Calculate covariance matrix from a large dataset - r

I have a dataframe like df, with a dimension of 10,000 x 40,000 (this matrix has a lot of 0's):
value1 <- c(1, 0, 3, 0, 0, 2)
value2 <- c(0.8, 0.1, 9, 0, 0, 5)
value3 <- c(8, 3, 0, 0, 0, 0)
df <- data_frame(value1, value2, value3)
I want to calculate the covariance matrix of df.
I have tried to use bigcor() and I have also tried to calculate the covariance matrix of a sparse matrix (Running cor() (or any variant) over a sparse matrix in R).
However, R session aborts.
Any help?

Related

Extracting the functional form of the likelihood function that gets formed by the msm function in R

Is there a way of extracting the functional form of the likelihood function that gets formed by the msm function in R?
How can I extract the likelihood function that gets formed in the example below? I want to try and implement my own version of the quasi-Newton maximisation algorithm to improve my understanding.
library(msm)
# look at transition counts
statetable.msm(state, PTNUM, data = cav)
# define transition intensity matrix
# 1's mean a transition can occur
# 0's mean a transition should not occur
# any number can be placed on the diagonal as R overwrites the diagonals
# prior to maximising
q <- rbind(
c(0, 1, 0, 1),
c(1, 0, 1, 1),
c(0, 1, 0, 1),
c(0, 0, 0, 0)
)
# fit msm to the data
# the fnscale rescales the likelihood to prevent overflow
msm.fit <- msm(state ~ years, PTNUM, data = cav, qmatrix = q, control=list(fnscale=4000))

Bray-Curtis Pairwise Analysis in R

I am trying to calculate and visualize the Bray-Curtis dissimilarity between communities at paired/pooled sites using the Vegan package in R.
Below is a simplified example dataframe:
Site = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
PoolNumber = c(1, 3, 4, 2, 4, 1, 2, 3, 4, 4)
Sp1 = c(3, 10, 7, 0, 12, 9, 4, 0, 4, 3)
Sp2 = c(2, 1, 17, 1, 2, 9, 3, 1, 6, 7)
Sp3 = c(5, 12, 6, 10, 2, 4, 0, 1, 3, 3)
Sp4 = c(9, 6, 4, 8, 13, 5, 2, 20, 13, 3)
df = data.frame(Site, PoolNumber, Sp1, Sp2, Sp3, Sp4)
"Site" is a variable indicating the location where each sample was taken
The "Sp" columns indicate abundance values of species at each site.
I want to compare pairs of sites that have the same "PoolNumber" and get a dissimilarity value for each comparison.
Most examples suggest I should create a matrix with only the "Sp" columns and use this code:
matrix <- df[,3:6]
braycurtis = vegdist(matrix, "bray")
hist(braycurtis)
However, I'm not sure how to tell R which rows to compare if I eliminate the columns with "PoolNumber" and "Site". Would this involve organizing by "PoolNumber", using this as a row name and then writing a loop to compare every 2 rows?
I am also finding the output difficult to interpret. Lower Bray-Curtis values indicate more similar communities (closer to a value of 0), while higher values (closer to 1) indicate more dissimilar communities, but is there a way to tell directionality, which one of the pair is more diverse?
I am a beginner R user, so I apologize for any misuse of terminology/formatting. All suggestions are appreciated.
Thank you
Do you mean that you want to get a subset of dissimilarities with equal PoolNumber? The vegdist function will get you all dissimilarities, and you can pick your pairs from those. This is easiest when you first transform dissimilarities into a symmetric matrix and then pick your subset from that symmetric matrix:
braycurtis <- vegdist(df[,3:6])
as.matrix(braycurtis)[df$PoolNumber==4,df$PoolNumber==4]
as.dist(as.matrix(braycurtis)[df$PoolNumber==4,df$PoolNumber==4])
If you only want to have averages, vegan::meandist function will give you those:
meandist(braycurtis, df$PoolNumber)
Here diagonal values will be mean dissimilarities within PoolNumber and off-diagonal mean dissimilarities between different PoolNumbers. Looking at the code of vegan::meandist you can see how this is done.
Bray-Curtis dissimilarities (like all normal dissimilarities) are a symmetric measure and it has no idea on the concept of being diverse. You can assess the degree of being diverse for each site, but then you need to first tell us what do you mean with "diverse" (diversity or something else?). Then you just need to use those values in your calculations.
If you just want to look at number of items (species), the following function will give you the differences in the lower triangle (and the upper triangle values will be the same with a switch of a sign):
designdist(df[,3:6], "A-B", "binary")
Alternatively you can work with row-wise statistics and see their differences. This is an example with Shannon-Weaver diversity index:
H <- diversity(df[,3:6])
outer(H, H, "-")
To get the subsets, work similarly as with the Bray-Curtis index.

Fitting Binomial Distribution in R using data with varying sample sizes

I have some data that looks like this:
x y
1: 3 1
2: 6 1
3: 1 0
4: 31 8
5: 1 0
---
(Edit: if it helps, here are sample vectors for x and y
x = c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y = c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)
The column on the left (x) is my sample size, and the column on the right (y) is the number successes that occur in each sample.
I would like to fit these data using a binomial distribution in order to find the probability of a success (p). All examples for fitting a binomial distribution that I've found so far assume a constant sample size (n) across all data points, but here I have varying sample sizes.
How do I fit data like these, with varying sample sizes, to a binomial distribution? The desired outcome is p, the probability of observing a success in a sample size of 1.
How do I accomplish a fit like this using R?
(Edit #2: Response below outlines solution and related R code if I assume that the events observed in each sample can be assumed to be independent, in addition to assuming that the samples themselves are also independent. This works for my data - thanks!)
What about calculating the empirical probability of success
x <- c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y <- c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)
avr.sample <- mean(x)
avr.success <- mean(y)
p <- avr.success/avr.sample
[1] 0.1151515
Or using binom.test
z <- x-y # number of fails
binom.test(x = c(sum(y), sum(z)))
Exact binomial test
data: c(sum(y), sum(z))
number of successes = 19, number of trials = 165, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.07077061 0.17397215
sample estimates:
probability of success
0.1151515
However, this assumes that:
The events corresponding to the rows are independent from each other
The events in the same row are independent from each other as well
This means in every iteration k of the experiment (i.e. row of x) we execute an action such as throwing x[k] identical dices (not necessarily fair dices) and success would mean to get a given (predetermined) number n in 1:6.
If we supposed that that above results were achieved when trying to get a 1 when throwing x[k] dices in every iteration k, then one could say that the empirical probability of getting a 1 is (~) 0.1151515.
In the end, the distribution in question would be B(sum(x), p).
PS: In the above illustration, the dices are identical to each other not only in any given iteration but across all iterations.
library(bbmle)
x = c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y = c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)
mf = function(prob, x, size){
-sum(dbinom(x, size, prob, log=TRUE))
}
m1 = mle2(mf, start=list(prob=0.01), data=list(x=y, size=x))
print(m1)
Coefficients:
prob
0.1151535
Log-likelihood: -13.47

Sage polynomial coefficients including zeros

If we have a multivariate polynomial in SAGE for instance
f=3*x^3*y^2+x*y+3
how can i display the full list of coefficients including the zero ones from missing terms between maximum dregree term and constant.
P.<x,y> = PolynomialRing(ZZ, 2, order='lex')
f=3*x^2*y^2+x*y+3
f.coefficients()
gives me the list
[3, 1, 3]
but i'd like the "full" list to put into a a matrix. In the above example it should be
[3, ,0 , 0, 1, 0, 0, 0, 0, 3]
corresponding to terms:
x^2*y^2, x^2*y, x*y^2, x*y, x^2, y^2, x, y, constant
Am I missing something?
Your desired output isn't quite well defined, because the monomials you listed are not in the lexicographic order (which you used in the first line of your code). Anyway, using a double loop you can arrange coefficients in any specific way you want. Here is a natural way to do this:
coeffs = []
for i in range(f.degree(x), -1, -1):
for j in range(f.degree(y), -1, -1):
coeffs.append(f.coefficient({x:i, y:j}))
Now coeffs is [3, 0, 0, 0, 1, 0, 0, 0, 3], corresponding to
x^2*y^2, x^2*y, x^2, x*y^2, x*y, x, y, constant
The built-in .coefficients() method is only useful if you also use .monomials() which provides a matching list of monomials that have those coefficients.

how to simulate correlated binary data with R? [duplicate]

This question already has answers here:
Generate correlated random numbers from binomial distributions
(3 answers)
Closed 9 years ago.
Supposing I want 2 vectors of binary data with specified phi coefficients, how could I simulate it with R?
For example, how can I create two vectors like x and y of specified vector length with the cor efficient of 0.79
> x = c(1, 1, 0, 0, 1, 0, 1, 1, 1)
> y = c(1, 1, 0, 0, 0, 0, 1, 1, 1)
> cor(x,y)
[1] 0.7905694
The bindata package is nice for generating binary data with this and more complicated correlation structures. (Here's a link to a working paper (warning, pdf) that lays out the theory underlying the approach taken by the package authors.)
In your case, assuming that the independent probabilities of x and y are both 0.5:
library(bindata)
## Construct a binary correlation matrix
rho <- 0.7905694
m <- matrix(c(1,rho,rho,1), ncol=2)
## Simulate 10000 x-y pairs, and check that they have the specified
## correlation structure
x <- rmvbin(1e5, margprob = c(0.5, 0.5), bincorr = m)
cor(x)
# [,1] [,2]
# [1,] 1.0000000 0.7889613
# [2,] 0.7889613 1.0000000

Resources