dat <- as.data.frame(replicate(100,sample(c(0,1),100,replace=TRUE)))
I want to create a 100 by 100 matrix with the correlation coefficients between these binary variables as entries.
If the variables were continuous, then I would have used cor() to create the matrix. I am not sure if cor() with Pearson as the method is reasonable. If not, say I could find a function fn() to calculate the correlation between a pair of binary vectors. What is an efficient way to construct the 100 by 100 matrix?
Not sure this is a stack overflow answer. What you are asking is for the correlation between binary vectors. This is called the Phi coefficient which was discovered by Pearson.
It approximates the Pearson correlation for small values. You might try
sqrt(chisq.test(table(dat[,1],dat[,2]), correct=FALSE)$statistic/length(dat[,1]))
and notice that it gives the same value 0.08006408 as
cor(dat[1], dat[2])
This is because the approximation is quite good for reasonably large values, say greater than 40.
So, I would advocate saving yourself some time and just using cor(dat) as the solution.
Related
I have a dataset consisting of 132 observations and 10 variables.
These variables are all categorical. I am trying to see how my observations cluster and how they are different based on the percentage of variance. i.e I want to find out if a) there are any variables which helps to draw certain observation points apart from one another and b) if yes, what is the percentage of variance explained by it?
I was advised to run a PCoA (Principle Coordinates Analysis) on my data. I ran it using vegan and ape package. This is my code after loading my csv file into r, I call it data
#data.dis<-vegdist(data,method="gower",na.rm=TRUE)
#data.pcoa<-pcoa(data.dis)
I was then told to extract the vectors from the pcoa data and so
#data.pcoa$vectors
It then returned me 132 rows but 20 columns of values (e.g. from Axis 1 to Axis 20)
I was perplexed over why there were 20 columns of values when I only have 10 variables. I was under the impression that I would only get 10 columns. If any kind souls out there could help to explain a) what do the vectors actually represent and b) how do I get the percentage of variance explained by Axis 1 and 2?
Another question that I had was I don't really understand the purpose of extracting the eigenvalues from data.pcoa because I saw some websites doing that after running a pcoa on their distance matrix but there was no further explanation on it.
Gower index is non-Euclidean and you can expect more real axes than the number of variables in Euclidean ordination (PCoA). However, you said that your variables are categorical. I assume that in R lingo they are factors. If so, you should not use vegan::vegdist() which only accepts numeric data. Moreover, if the variable is defined as a factor, vegan::vegdist() refuses to compute the dissimilarities and gives an error. If you managed to use vegdist(), you did not properly define your variables as factors. If you really have factor variables, you should use some other package than vegan for Gower dissimilarity (there are many alternatives).
Te percentage of "variance" is a bit tricky for non-Euclidean dissimilarities which also give some negative eigenvalues corresponding to imaginary dimensions. In that case, the sum of all positive eigenvalues (real axes) is higher than the total "variance" of data. ape::pcoa() returns the information you asked in the element values. The proportion of variances explained is in its element values$Relative_eig. The total "variance" is returned in element trace. All this was documented in ?pcoa where I read it.
There is an ongoing discussion about the reliable methods of rounding imputed binary variables. Still, the so-called Adaptive Rounding Procedure developed by Bernaards and colleagues (2007) is currently the most widely accepted solution.
Adoptive Rounding Procedure involves normal approximation to a binomial distribution. That is, the imputed values in a binary variable are assigned the values of either 0 or 1, based on the threshold derived by the below formula, where x is the mean of the imputed binary variable:
threshold <- mean(x) - qnorm(mean(x))*sqrt(mean(x)*(1-mean(x)))
To the best of my knowledge, major R packages on imputation (such as Amelia or mice) have yet to include functions that help with the rounding of binary variables. This shortcoming makes it difficult especially for researchers who intend to use the imputed values in logistic regression analysis, given that their dependent variable is coded in binary.
Therefore, it makes sense to write an R function for the Bernaards formula above:
bernaards <- function(x)
{
mean(x) - qnorm(mean(x))*sqrt(mean(x)*(1-mean(x)))
}
With this formula, it is much easier to calculate the threshold for an imputed binary variable with a mean of, say, .623:
bernaards(.623)
[1] 0.4711302
After calculating the threshold, the usual next step is to round the imputed values in variable x.
My question is: how can the above function be extended to include that task as well?
In other words, one can do all of the above in R with three lines of code:
threshold <- mean(x) - qnorm(mean(x))*sqrt(mean(x)*(1-mean(x)))
df$x[x > threshold] <- 1
df$x[x < threshold] <- 0
It would be best if the function included the above recoding/rounding, as repeating the same process for each binary variable would be time-consuming, especially when working with large data sets. With such a function, one could simply run an extra line of code (as below) after imputation, and continue with the analyses:
bernaards(dummy1, dummy2, dummy3)
I've tried to contact William Revelle about this but he isn't responding.
In the psych package there is a function called cor.smoother, which determines whether or not a correlation matrix is positive definite. Its explanation is as follows:
"cor.smoother examines all of nvar minors of rank nvar-1 by systematically dropping one variable at a time and finding the eigen value decomposition. It reports those variables, which, when dropped, produce a positive definite matrix. It also reports the number of negative eigenvalues when each variable is dropped. Finally, it compares the original correlation matrix to the smoothed correlation matrix and reports those items with absolute deviations great than cut. These are all hints as to what might be wrong with a correlation matrix."
It is the really the statement in bold that I am hoping someone can interpret in a more understandable way for me?
A belated answer to your question.
Correlation matrices are said to be improper (or more accurately, not positive semi-definite) when at least one of the eigen values of the matrix is less than 0. This can happen if you have some missing data and are using pair-wise complete correlations. It is particularly likely to happen if you are doing tetrachoric or polychoric correlations based upon data sets with some or even a lot of missing data.
(A correlation matrix, R, may be decomposed into a set of eigen vectors (X) and eigen values (lambda) where R = X lambda X’. This decomposition is the basis of components analysis and factor analysis, but that is more than you want to know.)
The cor.smooth function finds the eigen values and then adjusts the negative ones by making them slightly positive (and adjusting the other ones to compensate for this change).
The cor.smoother function attempts to identify the variables that are making the matrix improper. It does this by considering all the matrices generated by dropping one variable at a time and seeing which ones of those are not positive semi-definite (i.e. have eigen values < 0.) Ideally, this will identify one variable that is messing things up.
An example of this is in the burt data set where the sorrow-tenderness correlation was probably mistyped and the .87 should be .81.
cor.smoother(burt) #identifies tenderness and sorrow as likely culprits
I have used the factanal function in R to do a factor analysis on a data set.
Viewing the summary of the output, I see I have access to the loading and other objects, but I am interested in the scores of the factor analysis.
How can I get the scores when using the factanal function?
I attempted to calculate the scores myself:
m <- t(as.matrix(factor$loadings))
n <- (as.matrix(dataset))
scores <- m%*%n
and got the error:
Error in m %*% n : non-conformable arrays
which I'm not sure why, since I double checked the dimension of the data and the dimensionality is in agreement.
Thanks everyone for your help.
Ah.
factormodel$loadings[,1] %*% t(dataset)
This question might be a bit dated, but nevertheless:
factanal returns a matrix of scores. You simply call it like you called the loadings: factor$scores. No need to calculate it yourself. But you do need to specify in the function that you want to produce the scores, by using the "scores" argument.
Your solution, of multiplying the loadings by the observation matrix, is wrong. According to the FA model, the observed dataset should be the multiplication of loadings and scores (plus the unique contributions, and then rotation). This is not equivalent to what you wrote. I think you treated the loadings as the coefficients from observed data to scores, rather than the other way around (from scores to observations).
I found this paper that explains about different ways to extract scores, might be useful.
I have 2 two dissimilarity matrices. One with observed data comparing among 111 sites and another generated using a null model.
I would like to use the adnois function in vegan to test whether the observed dissimilarities differ significantly from those expected by the null model. However the adonis function will only take one dissimilarity matrix on the left side of the formula.
Does anyone have any idea how to model this test?
Thanks
The answer to this problem was:
meanjac <- function(x) mean(vegdist(x, method='jaccard', diag=TRUE))
test <- oecosimu(x, nestfun=meanjac, method="r1", nsimul = 10^3, statistic='adonis')
which passes a function to get the mean of jaccard dissimilarity matrix to oecosimu, which then uses the 'r1' method to generate null community matrices by randomly shuffling the binary community matrix but assigning the probability of species occupancies based on their observed occupancy and comparing this to the observed dissimilarity matrix.
Thanks Jari for pointing me in the right direction...