I am using the R package mclust to separate data into clusters. For this, I am using a uni-dimensional that allows for variable variances of the normal distributions underlying the clustering (the "V" model in the package).
The function looks like this: Mclust(dataToCluster, G=possibleClusters, modelNames=c("V")). To define the number of clusters possible, I use an array possibleClusters, e. g. 1:4 to allow for one to four clusters.
As a result of the clustering, after automatic model selection by Mclust using the BIC, I get a result with parameters of a normal distribution. For a model with three clusters, it might look like this:
# output shortened and commented for better readibility
> result$parameters
# proportion of data points per cluster ("lambda")
$pro
[1] 0.3459566 0.3877521 0.2662913
# mean of normal distribution per cluster ("mu")
$mean
1 2 3
110.3197 204.0477 265.0929
# variances per cluster ("sigma sq")
$variance$sigmasq
[1] 342.5032 128.4648 254.9257
However, I do have some knowledge about what these parameters are supposed to look like a priori. For example, I might know that:
sigmasq must be between 100 and 1000 units
the mean value for adjacent clusters must be at least 40 units apart
if there are three clusters, the mean value of the third cluster must be at least 215 units
Here is a graphical example for possible results of the clustering (the x axis corresponds to the units of mean and sigma unsquared):
Taking into account the constraints given above, example plots A1 (according to rules 1 and 2) and B1 (according to rules 2 and 3) can't be correct. Instead, the results should look more like A2 and B2, which were produced using slightly different data. Note that, taking into account these constraints, the “best” number of clusters might change (A1 vs. A2).
I would like to know how to include this kind of a priori information when using the Mclust function. The function does have a parameter prior, which might allow for this but I wasn't able to figure out how this could work. How could I bring the constraints into the function?
Related
I am struggling with R loess function in R.
I have a dataframe on which I would locally weighted polynomial regression
For each ‘Gene’ is associated a Count (log10 transformed) which gives information regarding the gene expression. For each Gene is associated an ‘Integrity’ measurement (span 0-100) which tells you the quality of the ‘Count’ measurement for each gene. As a general principle, higher is the ‘Integrity’, more reliable is the ‘Count’ for the specific Gene.
Below is reported a chunk of the dataframe
Sample dataframe:
Gene
Integrity
Count
ENSG00000198786.2
96.6937
3.55279
ENSG00000210194.1
96.68682
1.39794
ENSG00000212907.2
94.81709
2.396199
ENSG00000198886.2
93.87207
3.61595
ENSG00000198727.2
89.08319
3.238548
ENSG00000198804.2
88.82048
3.78326
I would like to use loess to predict the ‘true’ value of genes with low ‘Integrity’ values (since less reliable).
I) Should I pre-process my dataframe in order to correctly apply loess ? From a pletora of examples I observed sinusoidal distributions of points (A), while my dataset seem distributed in a ‘rollercoaster’-like fashion (B).
II) How should I run loess?
I cannot understand how to run loess with the correct syntax to differentially weighted observations:
-1 loess( Count ~ Integrity, weight=None)
-2 loess( Count ~ 1:nrow(dataframe), weight=Integrity)
I performed several tests. Fig. C-D used loess (stats), Fig. E-F run weightedloess (limma). I used 2 different packages since, from the loess docs it is clear that prior weights are set based on x distance between points. weightedloess function allow the user to give priors in order to perform regression.
Below is reported the basic sintax adopted to perform regression and to generate images.
C) loess(Count ~ Integrity),degree=2,span=0.1)
D) loess(Count ~ 1:nrow(df)),weigths=’Integrity’,degree=2,span=0.1)
E) weightedLowess(x=1:nrow(df), y=Count, weigths=’Integrity’, span=0.1)
F) weightedLowess(x=1:nrow(df), y=order(Count), weigths=’Integrity’, span=0.1)
Please find enclosed images cited in the question.
Sample Images
I've trained a model to predict a certain variable. When I now use this model to predict said value and compare this predictions to the actual values, I get the two following distributions.
The corresponding R Data Frame looks as follows:
x_var | kind
3.532 | actual
4.676 | actual
...
3.12 | predicted
6.78 | predicted
These two distributions obviously have slightly different means, quantiles, etc. What I would now like to do is combine these two distributions into one (especially as they are fairly similar), but not like in the following thread.
Instead, I would like to plot one density function that shows the difference between the actual and predicted values and enables me to say e.g. 50% of the predictions are within -X% and +Y% of the actual values.
I've tried just plotting the difference between predicted-actual and also the difference compared to the mean in the respective group. However, neither approach has produced my desired result. With the plotted distribution, it is especially important to be able to make above statement, i.e. 50% of the predictions are within -X% and +Y% of the actual values. How can this be achieved?
Let's consider the two distributions as df_actual, df_predicted, then calculate
# dataframe with difference between two distributions
df_diff <- data.frame(x = df_predicted$x - df_actual$x, y = df_predicted$y - df_actual$y)
Then find the relative % difference by :
x_diff = mean(( df_diff$x - df_actual$x) / df_actual $x) * 100
y_diff = mean(( df_diff$y - df_actual$y) / df_actual $y) * 100
This will give you % prediction whether +/- in x as well as y. This is my opinion and also follow this thread for displaying and measuring area between two distribution curves.
I hope this helps.
ParthChaudhary is right - rather than subtracting the distributions, you want to analyze the distribution of differences. But take care to subtract the values within corresponding pairs, or otherwise the actual - predicted differences will be overshadowed by the variance of actual (and predicted) alone. I.e., if you have something like:
x y type
0 10.9 actual
1 15.7 actual
2 25.3 actual
...
0 10 predicted
1 17 predicted
2 23 predicted
...
you would merge(df[df$type=="actual",], df[df$type=="predicted",], by="x"), then calculate and plot y.x-y.y.
To better quantify whether the differences between your predicted and actual distributions are significant, you could consider using the Kolmogorov-Smirnov test in R, available via the function ks.test
I have a list that looks like this, it is a measure of dispersion for each sample.
1 2 3 4 5
0.11829384 0.24987017 0.08082147 0.13355495 0.12933790
To further analyze this I need it to be a distance structure, the -vegan- package need it as a 'dist' object.
I found some solutions that applies to matrices > dist, but how could I change this current data into a dist object?
I am using the FD package, at the manual I found,
Still, one potential advantage of FDis over Rao’s Q is that in the unweighted case
(i.e. with presence-absence data), it opens possibilities for formal statistical tests for differences in
FD between two or more communities through a distance-based test for homogeneity of multivariate
dispersions (Anderson 2006); see betadisper for more details
I wanted to use vegan betadisper function to test if there are differences among different regions (I provided this using element "region" with column "region" too)
functional <- FD(trait, comun)
mod <- betadisper(functional$FDis, region$region)
using gowdis or fdisp from FD didn't work too.
distancias <- gowdis(rasgo)
mod <- betadisper(distancias, region$region)
dispersion <- fdisp(distancias, presence)
mod <- betadisper(dispersion, region$region)
I tried this but I need a list object. I thought I could pass those results to betadisper.
You cannot do this: FD::fdisp() does not return dissimilarities. It returns a list of three elements: the dispersions FDis for each sampling unit (SU), and the results of the eigen decomposition of input dissimilarities (eig for eigenvalues, vectors for orthonormal eigenvectors). The FDis values are summarized for each original SU, but there is no information on the differences among SUs. The eigen decomposition can be used to reconstruct the original input dissimilarities (your distancias from FD::gowdis()), but you can directly use the input dissimilarities. Function FD::gowdis() returns a regular "dist" structure that you can directly use in vegan::betadisper() if that gives you a meaningful analysis. For this, your grouping variable must be based on the same units as your distancias. In typical application of fdisp, the units are species (taxa), but it seems you want to get analysis for communities/sites/whatever. This will not be possible with these tools.
I would like to use R to perform hierarchical clustering with two groups of variables describing the same samples. One group is microarray gene expression data (for specific genes) that have been normalized and batch effect corrected. The other group also has some quantitative clinical parameters that describe the same samples. However, these clinical variables have not been normalized or subjected to any kind of transformation(i.e. raw continuous values).
For example, one variable of these could have range of values from 2 to 35, whereas another from 0.1 to 0.9, etc.
Thus, as my ultimate goal in to implement hierarchical clustering and use both groups simultaneously (merged in a matrix/dataframe), in order to inspect which of these clinical variables cluster with specific genes, etc:
1) Is an initial transformation in the group of the clinical variables necessary before merging with the genes and perform the clustering ? For example: log2 transformation, which has also been done to part of my gene expression data !!
2) Or, a row scaling (that is the total features in the input data) would take into account this discrepancy ?
3) For a similar analysis/approach, like constructing a correlation plot of the above total variables, would a simple scaling be sufficient?
Without having seen your gene expression data, I can only provide you some general suggestions based on your description, in the context of the 3 questions you asked:
1) You should definitely check the distribution of each group. In R, you may use one or more of the following function to visualize the distribution:
hist(expression_data) ##histogram
plot(density(expression_data)) ##density plot; alternative to histogram
qqnorm(expression_data); qqline(expression_data) #QQ plot
Since my understanding is that one of your expression data group is log2 transformed, that particular group should have a normal distribution (i.e. a bell curve shape in the histogram and a straight line in the QQ plot). Whether to transform the group that has not yet been transformed will depend on what you want to do with the data. For instance, if you want to use a t-test to compare the two groups, then you definitely need a transformation, as there is a normality assumption associated with a t-test. With regard to hierarchical clustering, if you decide to use both groups in a single clustering analysis, then why would you ever keep one transformed and the other not?
2) Scaling by features is a reasonable approach. Here is a clustering lecture from a Utah State Univ. stats course, with an example. scale=TRUE is an option for you if you decide to use heatmap function in R.
3) I don't think there is a definitive answer to your third question. It has to depend on how many available features you have and what analyses you will be doing downstream. Similar to question 1, I would argue that simple scaling may be sufficient for visualizing your data by hierarchical clustering. However, do keep in mind that, say you decide to perform a linear model (which is very common with microarray gene expression data), you might want to consider more sophisticated data scaling.
I am wondering if there is a case where you see something in the principal components (PC) what you do not see by looking univariately at the variables that the PCA is based on. For instance, considering the case of group differences: that you see a separation of two groups in one of the PCs, but not in a single variable (univariate).
I will use an example in the two dimensional setting to better illustrate my question: Lets suppose we have two groups, A and B, and for each observations we have two multivariate-normal distributed covariables.
# First Setting:
group_A <- mvrnorm(n=1000, mu=c(0,0), Sigma=matrix(c(10,3,3,2),2,2))
group_B <- mvrnorm(n=1000, mu=c(10,3), Sigma=matrix(c(10,3,3,2),2,2))
dat <- rbind(cbind.data.frame(group_A, group="A"),cbind.data.frame(group_B, group="B"))
plot(dat[,1:2], xlab="x", ylab="y", col=dat[,"group"])
In this first setting you see a group separation in the variable x, in the variable y, and you will also see a separation in both principal components. Hence, using the PCA we get the same result we got in the univariate case: the groups A and B have different values in the variables x and y.
In a second example generated by myself, you do not see a separation in variable x, variable y, or in PC1 or PC2. Hence, although our common sense suggests that we can distinguish between the two groups based on x and y, we do not observe this in the univariate case and the PCA doesn't help us either:
# Second setting
group_A <- mvrnorm(n=1000, mu=c(0,0), Sigma=matrix(c(10,3,3,2),2,2))
group_B <- mvrnorm(n=1000, mu=c(0,0), Sigma=matrix(c(10,-3,-3,2),2,2))
dat <- rbind(cbind.data.frame(group_A, group="A"),cbind.data.frame(group_B, group="B"))
plot(dat[,1:2], xlab="x", ylab="y", col=dat[,"group"])
QUESTION: Is there a case in where the PCA helps us in extracting correlations or separations we would not see in the univariate case? Can you construct one or is this not possible in the two-dimensional case.
Thank you all in advance for helping me to disentanglie this.
I think your question is mainly the result of a misunderstanding of what PCA does. It does't find clusters of data like, say, kmeans or DBSCAN. It projects n-dimensional data onto an orthogonal basis. Then it selects the top k dimensions (according to variance explained), where k < n.
So in your example, PCA doesn't know that group A was generated by some distribution and group B by another. It just sees the data in 2 dimensions and finds two principle components (from which you may or may not select 1). You might as well plot all 2000 data points in the same color.
However, if you wanted to use PCA in this instance, you would indicate that a 3rd dimension distinguishes between group A and group B. You could, for example, label group A +1 and group B -1 (or something that makes sense relative to the scale of the other dimensions). Then perform PCA on 3 dimensions, reducing to 2 or 1, depending on what the eigenvalues tell you about the variation explained.