How can I perform two matrix comparison and statistical testing applying permutations in R? - r

Im a begginer using R, and Im trying to apply a permutation process (statistical test) for two matrix comparison in order to assess if there is a real relation between spatial association and functional traits of forest tree species.
My first matrix is formed by indices of spatial interactions positive (+) and negative (-) between species. Positive interactions (+) were assigned with the value of 3 and negative interactions (-)
with the value of 1. The second matrix, include the mean euclidean distance of functional traits between species.
Theoretically, what I want to do, is to randomize or permute the trait matrix (incidence matrix) along the spatial matrix, without broken (or better said-retaining) the spatial structure of association among species. One permutation for positive ones (+) and other permutation for negative ones (-).
Can anybody help me to structure a script in R to test this previous relationship?
I just have the two csv species x species matrix files!.

Related

PCoA function pcoa extract vectors; percentage of variance explained

I have a dataset consisting of 132 observations and 10 variables.
These variables are all categorical. I am trying to see how my observations cluster and how they are different based on the percentage of variance. i.e I want to find out if a) there are any variables which helps to draw certain observation points apart from one another and b) if yes, what is the percentage of variance explained by it?
I was advised to run a PCoA (Principle Coordinates Analysis) on my data. I ran it using vegan and ape package. This is my code after loading my csv file into r, I call it data
#data.dis<-vegdist(data,method="gower",na.rm=TRUE)
#data.pcoa<-pcoa(data.dis)
I was then told to extract the vectors from the pcoa data and so
#data.pcoa$vectors
It then returned me 132 rows but 20 columns of values (e.g. from Axis 1 to Axis 20)
I was perplexed over why there were 20 columns of values when I only have 10 variables. I was under the impression that I would only get 10 columns. If any kind souls out there could help to explain a) what do the vectors actually represent and b) how do I get the percentage of variance explained by Axis 1 and 2?
Another question that I had was I don't really understand the purpose of extracting the eigenvalues from data.pcoa because I saw some websites doing that after running a pcoa on their distance matrix but there was no further explanation on it.
Gower index is non-Euclidean and you can expect more real axes than the number of variables in Euclidean ordination (PCoA). However, you said that your variables are categorical. I assume that in R lingo they are factors. If so, you should not use vegan::vegdist() which only accepts numeric data. Moreover, if the variable is defined as a factor, vegan::vegdist() refuses to compute the dissimilarities and gives an error. If you managed to use vegdist(), you did not properly define your variables as factors. If you really have factor variables, you should use some other package than vegan for Gower dissimilarity (there are many alternatives).
Te percentage of "variance" is a bit tricky for non-Euclidean dissimilarities which also give some negative eigenvalues corresponding to imaginary dimensions. In that case, the sum of all positive eigenvalues (real axes) is higher than the total "variance" of data. ape::pcoa() returns the information you asked in the element values. The proportion of variances explained is in its element values$Relative_eig. The total "variance" is returned in element trace. All this was documented in ?pcoa where I read it.

Weighted observation frequency clustering using hclust in R

I have a large matrix of 500K observations to cluster using hierarchical clustering. Due to the large size, i do not have the computing power to calculate the distance matrix.
To overcome this problem I chose to aggregate my matrix to merge those observations which were identical to reduce my matrix to about 10K observations. I have the frequency for each of the rows in this aggregated matrix. I now need to incorporate this frequency as a weight in my hierarchical clustering.
The data is a mixture of numerical and categorical variables for the 500K observations so i have used the daisy package to calculate the gower dissimilarity for my aggregated dataset. I want to use hclust in the stats package for the aggregated dataset however i want to take into account the frequency of each observation. From the help information for hclust the arguments are as follows:
hclust(d, method = "complete", members = NULL)
The information for the members argument is:, NULL or a vector with length size of d. See the ‘Details’ section. When you look at the details section you get: If members != NULL, then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means.
From the above description, i am unsure if i can assign my frequency weights to the members arguments as it is not clear if this is the purpose of this argument. I would like to use it like this:
hclust(d, method = "complete", members = df$freq)
Where df$freq is the frequency of each row in the aggregated matrix. So if a row is duplicated 10 times this value would be 10.
If anyone can help me that would be great,
Thanks
Yes, this should work fine for most linkages, in particular single, group average and complete linkage. For ward etc. you need to correctly take the weights into account yourself.
But even that part is not hard. Just make sure to use the cluster sizes, because you need to pass the distance of two clusters, not two points. So the matrix should contain the distance of n1 points at location x and n2 points at location y. For min/max/mean this n disappears or cancels out. For ward, you should get a SSQ like formula.

k means clustering on matrix

I am trying to cluster a Multidimensional Functional Object with the "kmeans" algorithms. What does it mean: So I don't have anymore a vector per each row or Individual, even more a 3x3 observation matrix per each Individual.For example: Individual = 1 has the following observations:
(x1, x2, x3),(y1,y2,y3),(z1,z2,z3).
The same structure of observations is also given for the other Individuals. So do you know how I can cluster with "kmeans" including all 3 observation vectors -and not only one observation vector how it is normal used for "kmeans" clustering?
Would you do it for each observation vector, f.e. (x1, x2, x3), separately and then combine the Information somehow together? I want to do this with the kmeans() Function in R.
Many thanks for your answers!
Using k-means you interpret each observation as a point in an N-dimensional vector space. Then you minimize the distances between your observations and the cluster centers.
Since, the data is viewed as dots in an N-dim space, the actual arrangement of the values does not matter.
You can, therefore, either tell your k-means routine to use a matrix norm, for example the Frobenius norm, to compute the distances. The other way would be to flatten your observations from 3 by 3 matrices to 1 by 9 vectors. The Frobenius norm of a NxN matrix is equivalent to the euclidean norm of a 1xN^2 vector.
Just give the argument to kmeans() with all the three columns it'll calculate the distances in 3 dimension, if that is what you are looking for.

Design Covariance Matrix in a simulation study in R in an efficient way

In my simulation study I need to come up with a covariance matrix for multivariate data.
My data:
dataset=data.frame(observation=rep(1:8,2),plot=rep(1:4,each=2),time=rep(1:2,8),treatment=rep(c("A","B","A","B"),each=4),OutputVariable=rep(c("P","Q"),each=8))
This dataset is multivariate, for every observation (1:8) there is more than one result. In this case, we observe a value for OutputVariable P and for OutputVariable Q at the same time. Note that actual outputs are not in this dataset as I will generate them at a later stage.
The desired Covariance Matrix would be 16x16. Where CovarMat[2,9] indicates the Covariance between the second line (Observation 2 of variable P) and the 9th line (Observation 1 of variable Q) in the dataset.
The value of, for instance, CovarMat[2,9] is based on rules like these:
CovarMat[2,9]=0
If dataset$plot[2]==dataset$plot[9] then CovarMat[2,9]=CovarMat[2,9]+1.5
If dataset$time[2]==dataset$time[9] then CovarMat[2,9]=CovarMat[2,9]+1.5
If (dataset$plot[2]==dataset$plot[9])&(dataset$time[2]==dataset$time[9]) then CovarMat[2,9]=CovarMat[2,9]+3
If abs(dataset$time[2]-dataset$time[9])=1 then CovarMat[2,9]=CovarMat[2,9]+2
Using For-loops thats easy enough (and thats what I did up to now). But my current dataset is 13,200 lines. And thus my CovarMat consists of 174,240,000 cells. Therefore, I am in desperate need of a more efficient way.

An "asymmetric" pairwise distance matrix

Suppose there are three sequences to be compared: a, b, and c. Traditionally, the resulting 3-by-3 pairwise distance matrix is symmetric, indicating that the distance from a to b is equal to the distance from b to a.
I am wondering if TraMineR provides some way to produce an asymmetric pairwise distance matrix.
No, TraMineR does not produce 'assymetric' dissimilaries precisely for the reasons stressed in Pat's comment.
The main interest of computing pairwise dissimilarities between sequences is that once we have such dissimilarities we can for instance
measure the discrepancy among sequences, determine neighborhoods, find medoids, ...
run cluster algorithms, self-organizing maps, MDS, ...
make ANOVA-like analysis of the sequences
grow regression trees for the sequences
Inputting a non symmetric dissimilarity matrix in those processes would most probably generate irrelevant outcomes.
It is because of this symmetry requirement that the substitution costs used for computing Optimal Matching distances MUST be symmetrical. It is important to not interpret substitution costs as the cost of switching from one state to the other, but to understand them for what they are, i.e., edit costs. When comparing two sequences, for example
aabcc and aadcc, we can make them equal either by replacing arbitrarily b with d in the first one or d with b in the second one. It would then not make sense not giving the same cost for the two substitutions.
Hope this helps.

Resources