Clustering Variables - r

What are some proven methods for finding groupings of highly correlated variables within a large, high-dimensional binary dataset (think 200,000+ rows and 150+ fields) that can be easily implemented in R? I want to find groupings of variables which lends itself to interpretation so I don't think PCA would be the best method.

library(Hmisc)
mtc <- mtcars[,2:8]
mtcn <- data.matrix(mtc)
clust <- varclus(mtcn)
clust
plot(clust)
?varclus :Does a hierarchical cluster analysis on variables, using the Hoeffding D statistic, squared Pearson or Spearman correlations, or proportion of observations for which two variables are both positive as similarity measures. Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction.
For Binary Vraibles:
library(cluster)
data(animals)
ma <- mona(animals)
ma
plot(ma)
?mona : Returns a list representing a divisive hierarchical clustering of a dataset with binary variables only.

Related

Multiple weighted matrices in a SLX model in R

Is there any packages or commands allow multiple weighted matrices in a spatial lagged X (SLX) model?
I want to include two different weighted matrices with one dependent variable, but I cannot find any packages for it?
Theoretically, in spatial analysis, including multiple W matrices are not appropriate? If it is possible, how can I conduct analysis with W1 and W2? Do I have to do it by hand?(I meant, once create the lagged variable by multiplying W matrix and the key DV, and and run a OLS regression with the variables. Is it the right way applying multiple weighted matrices?
Thanks!
Dongjin

Bootstrap for phylogenetic tree generated using Mahalanobis distance (R)

I created a phylogenetic NJ tree in R using the ape package. My data contains metric measures from multiple individuals belonging to known groups. Thus, I decided to calculate the Mahalanobis distance between these groups in order to incorporate the covariance structure in my analyses. Creating the nj tree thus was not a problem.
require(ape)
lda <- lda(y, as.factor(ynames))
dist <- dist(as.matrix(predict(lda,lda$mean)$x),upper=T,diag=T)
plot(nj(dist))
However, now I'd like to calculate some bootstrap values for branch splits. I'd use the boot.phylo function, but here I have no idea how I can deal with the FUN (function) command, and thus with the correct calculation of Mahalanobis distances for the bootstrapped data set.

How can I treat variables symmetrically or jointly in a random forest model in R?

I am trying to model correlations of stock returns.
The correlations are assumed to be functions of firm variables.
p = f(w,[(X1,X2),(Y1,Y2)])$
where w are parameters and X and Y are firm characteristics. The subscripts 1 and 2 are arbitrary. You could reverse them.
For a gam I could use (for instance):
corr.gam <- gam(corr ~ s(X1,X2,df=1)+s(Y1,Y2,df=1), data=df)
Which captures that it's the relationship between X1 and X2 that are predicted to have a relationship with correlation. For instance, if X is market value of the firm, two big firms might be more correlated than two small firms.
Is there a way to tell R that the trees it builds should be on the relationships between variables? Or maybe a different way to formulate the problem entirely?

Can I use a covariance matrix to specify the correlation structure in the nlme function gls?

I wish to use the function gls in the R package nlme to analyse a set of nested spatial samples, in which many samples overlap in at least some spatial coordinates. I want to account for non-independence in the response variable (the thing I'm measuring in each spatial sample) using either a corStruct or pdMat object, but I'm confused about how to do this.
I have generated a covariance matrix that should encode all the information about non-independence between spatial samples. Each row/column is a distinct spatial sample, the diagonal contains the total number of sampling units captured by each spatial sample, and the off-diagonal elements contain counts of sampling units shared between spatial samples.
I think I should use the nlme function gls while specifying a correlation structure, possibly using a corSymm or pdMat object. But I've only seen examples where the correlation structure in gls is specified via a formula. How can I use the covariance matrix that I've created?
I discovered that you can pass the nlme function gls a positive-definite correlation matrix by using the general correlation structure provided by corSymm.
# convert your variance covariance matrix into a correlation matrix
CM <- cov2cor(vcv_matrix)
# if your correlation matrix contains zeros, as mine did, you need to convert it to a positive-definite matrix that substitutes very small numbers for those zeros
CM <- nearPD(CM)$mat
# convert into a corStruct object using general correlation structure provided by corSymm
C <- corSymm(CM[lower.tri(CM)], fixed = T)
# correlation structure can now be included in a gls model
gls(y ~ x, correlation = C, method = "ML")

R: Comparing dissimilarity between metabolic models with Discrete Wavelet Transformation

I’m working on comparing bacteria metabolic models. Each model has a set of metabolites and their concentration for 200 time points. I’m in the process of comparing the models to cluster them based on their similarity.
One method I followed is I did a pair wise comparison for each of the metabolite pairs in two models using Euclidean distance. Below is how my data look like. This is a sample data file.
I computed pair wise Euclidean distance for Met1 from Model A and Met1 from Model B. Likewise computed the distances for all the common metabolites between the 2 models (Met4 in Model A and Met4 in Model B) and summed up the distances to get a distance (dissimilarity) between the two models. Similarly I computed the dissimilarity matrix for all the models and I used hierarchical clustering to cluster them.
Now I want to compute the dissimilarity of the models using Discrete Wavelet Transformation as my distance measure. However I couldn't find a method in the package definition on how to compare two time series. I would like to know how to use Discrete Wavelet Transformation to compute a dissimilarity distance between 2 time series and hence for my models.
Take a look at the TSclust package. Here how you would apply it to your sample data.
require(TSclust)
#read in the data
model_a <- read.csv("~/Desktop/Model A.csv", header = TRUE, stringsAsFactors = FALSE)
model_b <- read.csv("~/Desktop/Model B.csv", header = TRUE, stringsAsFactors = FALSE)
#data must be in rows rather than columns
model_a <- as.data.frame(t(model_a))
model_b <- as.data.frame(t(model_b))
#calculate dissimlarities between metabolites in models 1 and 2
met1_DWT.diss <- as.numeric(diss.DWT(rbind(model_a['Met1', ], model_b['Met1', ])))
met1_DWT.diss
[1] 90.80332
met2_DWT.diss <- as.numeric(diss.DWT(rbind(model_a['Met2', ], model_b['Met2', ])))
met2_DWT.diss
[1] 1.499241

Resources