I’m working on comparing bacteria metabolic models. Each model has a set of metabolites and their concentration for 200 time points. I’m in the process of comparing the models to cluster them based on their similarity.
One method I followed is I did a pair wise comparison for each of the metabolite pairs in two models using Euclidean distance. Below is how my data look like. This is a sample data file.
I computed pair wise Euclidean distance for Met1 from Model A and Met1 from Model B. Likewise computed the distances for all the common metabolites between the 2 models (Met4 in Model A and Met4 in Model B) and summed up the distances to get a distance (dissimilarity) between the two models. Similarly I computed the dissimilarity matrix for all the models and I used hierarchical clustering to cluster them.
Now I want to compute the dissimilarity of the models using Discrete Wavelet Transformation as my distance measure. However I couldn't find a method in the package definition on how to compare two time series. I would like to know how to use Discrete Wavelet Transformation to compute a dissimilarity distance between 2 time series and hence for my models.
Take a look at the TSclust package. Here how you would apply it to your sample data.
require(TSclust)
#read in the data
model_a <- read.csv("~/Desktop/Model A.csv", header = TRUE, stringsAsFactors = FALSE)
model_b <- read.csv("~/Desktop/Model B.csv", header = TRUE, stringsAsFactors = FALSE)
#data must be in rows rather than columns
model_a <- as.data.frame(t(model_a))
model_b <- as.data.frame(t(model_b))
#calculate dissimlarities between metabolites in models 1 and 2
met1_DWT.diss <- as.numeric(diss.DWT(rbind(model_a['Met1', ], model_b['Met1', ])))
met1_DWT.diss
[1] 90.80332
met2_DWT.diss <- as.numeric(diss.DWT(rbind(model_a['Met2', ], model_b['Met2', ])))
met2_DWT.diss
[1] 1.499241
Related
I created a phylogenetic NJ tree in R using the ape package. My data contains metric measures from multiple individuals belonging to known groups. Thus, I decided to calculate the Mahalanobis distance between these groups in order to incorporate the covariance structure in my analyses. Creating the nj tree thus was not a problem.
require(ape)
lda <- lda(y, as.factor(ynames))
dist <- dist(as.matrix(predict(lda,lda$mean)$x),upper=T,diag=T)
plot(nj(dist))
However, now I'd like to calculate some bootstrap values for branch splits. I'd use the boot.phylo function, but here I have no idea how I can deal with the FUN (function) command, and thus with the correct calculation of Mahalanobis distances for the bootstrapped data set.
I wish to use the function gls in the R package nlme to analyse a set of nested spatial samples, in which many samples overlap in at least some spatial coordinates. I want to account for non-independence in the response variable (the thing I'm measuring in each spatial sample) using either a corStruct or pdMat object, but I'm confused about how to do this.
I have generated a covariance matrix that should encode all the information about non-independence between spatial samples. Each row/column is a distinct spatial sample, the diagonal contains the total number of sampling units captured by each spatial sample, and the off-diagonal elements contain counts of sampling units shared between spatial samples.
I think I should use the nlme function gls while specifying a correlation structure, possibly using a corSymm or pdMat object. But I've only seen examples where the correlation structure in gls is specified via a formula. How can I use the covariance matrix that I've created?
I discovered that you can pass the nlme function gls a positive-definite correlation matrix by using the general correlation structure provided by corSymm.
# convert your variance covariance matrix into a correlation matrix
CM <- cov2cor(vcv_matrix)
# if your correlation matrix contains zeros, as mine did, you need to convert it to a positive-definite matrix that substitutes very small numbers for those zeros
CM <- nearPD(CM)$mat
# convert into a corStruct object using general correlation structure provided by corSymm
C <- corSymm(CM[lower.tri(CM)], fixed = T)
# correlation structure can now be included in a gls model
gls(y ~ x, correlation = C, method = "ML")
I am trying to create a ROC curve off the below. I get an error that states Error in prediction(bc_rf_predict_prob, bc_test$Class) :
Number of cross-validation runs must be equal for predictions and labels.
library(mlbench) #has the Breast Cancer dataset in it
library(caret)
data(BreastCancer) #two class model
bc_changed<-BreastCancer[2:11] #removes variables not to be used
#Create train and test/holdout samples (works fine)
set.seed(59)
bc_rand <- bc_changed[order(runif(699)), ] #699 observations
bc_rand <- sample(1:699, 499)
bc_train <- bc_changed[ bc_rand,]
bc_test <- bc_changed[-bc_rand,]
#random forest decision tree (works fine)
library(caret)
library(randomForest)
set.seed(59)
bc_rf <- randomForest(Class ~.,data=bc_train, ntree=500,na.action = na.omit, importance=TRUE)
#ROC
library(ROCR)
actual <- bc_test$Class
bc_rf_predict_prob<-predict(bc_rf, type="prob", bc_test)
bc.pred = prediction(bc_rf_predict_prob,bc_test$Class) #not work- error
Error-Error in prediction(bc_rf_predict_prob, bc_test$Class) :
Number of cross-validation runs must be equal for predictions and labels.
I think it is coming from the fact when I do the:
bc_rf_predict_prob<-predict(bc_rf, type="prob", bc_test)
I get a matrix as the result with two columns Benign and a list of its probabilities and a second column of Malignant and its list of probabilities. My logic tells me I should only have a vector of probabilities.
According to page 9 of the ROCR Library documentation, the prediction function has two required inputs, predictions and labels, which must have the same dimensions.
In the case of a matrix or data frame, all cross-validation runs must have the same length.
Since str(bc_rf_predict_prob) > [1] matrix [1:200, 1:2], this means str(bc_test$Class) should have a matching dimension.
It sounds like you only want the first column vector of bc_rf_predict_prob, but I can't be certain without looking at the data.
What are some proven methods for finding groupings of highly correlated variables within a large, high-dimensional binary dataset (think 200,000+ rows and 150+ fields) that can be easily implemented in R? I want to find groupings of variables which lends itself to interpretation so I don't think PCA would be the best method.
library(Hmisc)
mtc <- mtcars[,2:8]
mtcn <- data.matrix(mtc)
clust <- varclus(mtcn)
clust
plot(clust)
?varclus :Does a hierarchical cluster analysis on variables, using the Hoeffding D statistic, squared Pearson or Spearman correlations, or proportion of observations for which two variables are both positive as similarity measures. Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction.
For Binary Vraibles:
library(cluster)
data(animals)
ma <- mona(animals)
ma
plot(ma)
?mona : Returns a list representing a divisive hierarchical clustering of a dataset with binary variables only.
I would like to compare the behaviour of several dissimilarity measures (i.e. Bray-Curtis, Jaccard, Gower). I have seen this done using a principal component biplot (i.e. see Legendre and Caceres, 2013 below):
Any suggestions how one goes about this? Sample data provided below:
# Load the required packages
library(ade4)
library(vegan)
library(FD)
#Load data
data(dune)
# Calculate a series of dissimilarity measures for the data
dune.bc <- vegdist(dune, method="bray")
dune.mh <- vegdist(dune, method="manhattan")
dune.eu <- vegdist(dune, method="euclidean")
dune.cn <- vegdist(dune, method="canberra")
dune.k <- vegdist(dune, method="kulczynski")
dune.j <- vegdist(dune, method="jaccard")
dune.g <- vegdist(dune, method="gower")
dune.m <- vegdist(dune, method="morisita")
dune.h <- vegdist(dune, method="horn")
dune.mf <- vegdist(dune, method="mountford")
dune.r <- vegdist(dune, method="raup")
dune.bi <- vegdist(dune, method="binomial")
dune.c <- vegdist(dune, method="chao")
#Compare the behaviour of the dissimilarity measures using a PCA plot
# Suggestions on how proceed with this step would be greatly appreciated!
Hmm, that's not what the authors do. If you read that paper, the PCA biplot is one of the matrix of properties of each dissimilarity coefficient, not a PCA of on k dissimilarity matrices. Basically, they analysed Table 2 in the paper via PCA (minus the column at the far right, labelled *D*max).
I don't know a way to compare dissimilarity matrices, other than via a Procrustes rotation and associated PROTEST permutation test, or a Mantel test, perhaps: see procrustes(), protest() and mantel()
You can look at the rankindex() of the coefficients with the gradient values as another comparison.
It sounds like what you are trying to do is a Second Stage Analysis?
Take several dissimilarity matrices, generate pairwise rank correlations between all of them and this greats a dissimilarity matrices of your dissimilarity matrices. From there you can use NMDS to plot them all. In general you'll find that similar calculations (i.e. eucleadian family, bray-curtis family, ect.) group closely within.
Check out:
Exploring interactions by second-stage community analyses. (2006) clarke, somerfield, airoldi, and warwick
Here they do what you suggest, or want:
On resemblance measures for ecological studies, including taxonomic dissimilarities and a zer-adjusted Bray-Curtis coefficient for denuded assemblages. (2006) Clarke, Somerfield, and Chapman.