I have used the MixOmics package in R for a two matrices (canonical correlation analysis) and I have a resultant correlation matrix. I would like to build a correlation network from the result obtained. I earlier thought of using the gene set correlation analysis package but I do not know how to install it and there are no sources over the internet to install it in R (http://www.biostat.wisc.edu/~kendzior/GSCA/).
Also could you suggest what other packages I could use to build networks with correlation matrix as input ?? I thought of Rgraphviz but do not know if it is possible.
Copying this answer mostly from my previous answer at https://stackoverflow.com/a/7600901/567015
The qgraph package is mostly intended to visualize correlation matrices as a network. This will plot variables as nodes and correlations as edges connecting the nodes. Green edges indicate positive correlations and red edges indicate negative correlations. The wider and more saturated the edges the stronger the absolute correlation.
For example (this is the first example from the help page), the following code will plot the correlation matrix of a 240 variable dataset.
library("qgraph")
data(big5)
data(big5groups)
qgraph(cor(big5),minimum=0.25,cut=0.4,vsize=2,groups=big5groups,legend=TRUE,borders=FALSE)
title("Big 5 correlations",line=-2,cex.main=2)
You can also cluster strongly correlated nodes together (uses Fruchterman-Reingold) which creates quite a clear image of what the structure of your correlation matrix actually looks like:
For an extensive introduction take a look at http://www.jstatsoft.org/v48/i04/paper
You might also want to take a look at the network and sna packages on CRAN. Both include tools for converting a matrix into a network data object.
Related
Is there a way of calculating or estimating the area under the curve as an external metric, using base R, from confusion matrices alone?
If not, how would I do it, given the clustering object?
e.g. we can start from
cutree(hclust(dist(iris[,1:4])),method="average"),3))
or, from a diagonal-maximized version of
table(iris$Species, cutree(hclust(dist(iris[,1:4])),method="average"),3))
the latter being the confusion matrix. I would much, much prefer a solution that goes from the confusion matrix but if it's impossible we can use the clustering object itself.
I read the comments here: Calculate AUC in R? -- the top solution looks good, but it's unclear to me how to generalise it for multi-class data like iris.
(No packages, obviously, I want to find out how to do it by hand in base R)
I'm working with a set of co-ordinates, and want to dynamically (I have many sets that need to go through this process) understand how many distinct groups there are within the data. My approach was to apply k-means to investigate whether it would find the centroids and I could go from there.
When plotting some data with 6 distinct clusters (visually) the k-means algorithm continues to ignore two significant clusters while putting many centroids into another.
See image below:
Red are the co-ordinate data points and blue are centroids that k-means has provided. In this specific case I've gone for 15 (arbitrary), but it still doesn't recognise those patches of data on the right hand side, rather putting a mid point between them while putting in 8 in the cluster in the top right.
Admittedly there are slightly more data points in the top right, but not by much.
I'm using the standard k-means algorithm in R and just feeding in x and y co-ordinates. I've tried standardising the data, but this doesn't make any difference.
Any thoughts on why this is, or other potential methodologies that could be applied to try and dynamically understand the number of distinct clusters there are in the data?
You could try with Self-organizing map:
this is a clustering algorithm based on Neural Networks which create a discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction (SOM).
This algorithm is very good for clustering also because does not require a priori selection of the number of clusters (in k-mean you need to choose k, here no). In your case, it hopefully finds automatically the optimal number of cluster, and you can actually visualize it.
You can find a very nice python package called somoclu which has got this algorithm implemented and an easy way to visualize the result. Else you can go with R. Here you can find a blog post with a tutorial, and Cran package manual for SOM.
K-means is a randomized algorithm and it will get stuck in local minima.
Because of these problems, it is common to run k-means several times, and keep the result with least squares, I.e., the best of the local minima found.
I was initially trying to reproduce PCA plots shown in this paper (Figure 1).
The paper uses PCA technique to visualize protein structure conformations in a lower dimension as per reference 16 (Figure 1 - B and C). Each point in the PC plots represents a protein structure in a lower dimensional space. But I have some doubts now, as I am trying to reproduce these plots. So I looked in this link which is a R library called bio3d from the authors of reference-16. Each pdb files has {X Y Z} coordinate positions in their pdb files. After aligning the regions among proteins you take these data for PCA. I am trying to reproduce the results which bio3d toolbox example page has but using MATLAB (since I am not familiar with R). But I am unable to get the plot as in FIGURE-9 in the bio3d link.
Can someone help me to reproduce these figures? I have my matlab script and 6 structures prepared as in the webpage uploaded here. The script will help you to load data only although I have made some attempt from my side.
UPDATE 1 : In short, my question is:
Can someone advice me how to prepare the covariance matrix from the 6 structures with their coordinates for this particular problem, so that I can do PCA on it?
UPDATE 2 : I have initially mistakenly shared non-aligned pdb strucutre files in the google drive. I have correctly uploaded it.
Quoting from the question:
After aligning the regions among proteins you take these data for PCA. (Emphasis added).
You do not seem to have aligned the regions among the proteins first.
This application of PCA to protein structures starts with a set of similar proteins whose 3-dimensional structures have been determined, perhaps under different conditions of biological interest. For example, the proteins may have been bound to specific small molecules that regulate their structure and function. The idea is that most of the structure of these proteins will agree closely under these different conditions, while the portions of the proteins that are most important for function will be different. Those most important portions of the proteins thus may show variance in 3-dimensional positions among the set of structures, and clusters in principal components (as in part C of the first figure in this question) illustrate which particular combinations of proteins and experimental conditions are similar to each other in terms of these differences in 3-dimensional structure.
The {X,Y,Z} coordinates of the atoms in the proteins, however, may have different systematic orientations in space among the set of protein structures, as the coordinate system in any one case is based on details of the x-ray crystallography or other methods used to determine the structures. So the first step is to rotate the individual protein structures so that all protein structures align as closely as possible to start. Then variances are calculated around those closely aligned (after rotation) 3-dimensional structures. Otherwise, most of the variance in {X,Y,Z} space will represent the differences in systematic orientation among the crystallography sessions.
As with all R packages, bio3d has publicly available source code. The pdbfit() function includes 2 important pre-processings before PCA. It tries to account for gaps in structures with a gap.inspect() function, and then it rotates the protein structures in 3 dimensions for best overall alignment with a fit.xyz() function. Only then does it proceed to PCA.
You certainly could try to reproduce those pre-processing functionalities in MATLAB, but in this case it might be simplest to learn enough R to take advantage of what is already provided in this extensive package.
I am attempting to cluster the behavioral traits of 250 species into life-history strategies. The trait data consists of both numerical and nominal variables. I am relatively new to R and to cluster analysis, but I believe the best option to find the distances for these points is to use the gower similarity method within the daisy function. 1) Is that the best method?
Once I have these distances, I would like to find significant clusters. I have looked into pvclust and like its ability to give me the strength of the cluster. However, I have not been able to modify the code to accept the distance measurements previously made using daisy. I have unsuccessfully tried to follow the advice given here https://stats.stackexchange.com/questions/10347/making-a-heatmap-with-a-precomputed-distance-matrix-and-data-matrix-in-r/10349#10349 and using the code obtained here http://www.is.titech.ac.jp/~shimo/prog/pvclust/pvclust_unofficial_090824/pvclust.R
2)Can anyone help me to modify the existing code to accept my distance measurements?
3) Or, is there another better way to determine the number of significant clusters?
I thank all in advance for your help.
Some comments...
About 1)
It is a good way to deal with different types of data.
You could also create as many new rows in the dataset as possible nominal values and put 1/0 where it is needed. For example if there are 3 nominal values such as "reptile", "mammal" and "bird" you could change your initial dataset that has 2 columns (numeric, Nominal)
for a new one with 4 columns (numeric, numeric( representing reptile), numeric(representing mammal), numeric(representing bird)) an instance (23.4,"mammal") would be mapped to (23.4,0,1,0).
Using this mapping you could work with "normal" distances (be sure to standardize the data so that no column dominates the others due to it's big/small values).
About 2)
daisy returns an element of type dissimilarity, you can use it in other clustering algorithms from the cluster package (maybe you don't have to implement more stuff). For example the function pam can get the object returned by daisy directly.
About 3)
Clusters are really subjective and most cluster algorithms depend on the initial conditions so "significant clusters" is not really a term that some people would not be comfortable using. Pam could be useful in your case because clusters are centered using medoids which is good for nominal data (because it is interpretable). K-means for example has the disadvantage that the centroids are not interpretable (what does it mean 1/2 reptile 1/2 mammal?) pam builds the clusters centered to instances which is nice for interpretation purposes.
About pam:
http://en.wikipedia.org/wiki/K-medoids
http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.html
You can use Zahn algorithm to find the cluster. Basically it's a minimum spanning tree and a function to remove the longest edge.
I am using the bnlearn package in r, which generates Bayesian networks using data. I am trying to get more connections between the data nodes, and hence, I am trying to decrease the weight threshold necessary to generate arcs between the nodes. I am using the gs function in the bnlearn package, which uses a grow-shrink algorithm. So far, I have tried modifying the alpha threshold, but that appears to change the threshold of error.
Ultimately, my goal is to have the algorithm create more arcs between the points.
Thanks
You might need to first find the weight of all arcs, and selectively filter them yourself. I don't think bnlearn has that built in.