Plotting different mixture model clusters in the same curve - r

I have two sets of data, one representing a healthy data set having 4 variables and 11,000 points and another representing a faulty set having 4 variables and 600 points. I have used R's package MClust to obtain GMM clustering for each data set separately. What I want to do is to obtain both clusters in the same frame so as to study them at the same time. How can that be done?
I have tried joining both the datasets but the result I am obtaining is not what I want.
The code in use is:
Dat4M <- Mclust(Dat3, G = 3)
Dat3 is where I am storing my dataset, Dat4M is where I store the result of Mclust. G = 3 is the number of Gaussian mixtures I want, which in this case is three. To plot the result, the following code is used:
plot(Dat4M)
The following is obtained when I apply the above code in my Healthy dataset:
The following is obtained when the above code is used on Faulty dataset:
Notice that in the faulty data density curve, consider the mixture of CCD and CCA, we see that there are two density points that have been obtained. Now, I want to place the same in the same block in the healthy data and study the differences.
Any help on how to do this will be appreciated.

Related

Loess regression on genomic data

I am struggling with R loess function in R.
I have a dataframe on which I would locally weighted polynomial regression
For each ‘Gene’ is associated a Count (log10 transformed) which gives information regarding the gene expression. For each Gene is associated an ‘Integrity’ measurement (span 0-100) which tells you the quality of the ‘Count’ measurement for each gene. As a general principle, higher is the ‘Integrity’, more reliable is the ‘Count’ for the specific Gene.
Below is reported a chunk of the dataframe
Sample dataframe:
Gene
Integrity
Count
ENSG00000198786.2
96.6937
3.55279
ENSG00000210194.1
96.68682
1.39794
ENSG00000212907.2
94.81709
2.396199
ENSG00000198886.2
93.87207
3.61595
ENSG00000198727.2
89.08319
3.238548
ENSG00000198804.2
88.82048
3.78326
I would like to use loess to predict the ‘true’ value of genes with low ‘Integrity’ values (since less reliable).
I) Should I pre-process my dataframe in order to correctly apply loess ? From a pletora of examples I observed sinusoidal distributions of points (A), while my dataset seem distributed in a ‘rollercoaster’-like fashion (B).
II) How should I run loess?
I cannot understand how to run loess with the correct syntax to differentially weighted observations:
-1 loess( Count ~ Integrity, weight=None)
-2 loess( Count ~ 1:nrow(dataframe), weight=Integrity)
I performed several tests. Fig. C-D used loess (stats), Fig. E-F run weightedloess (limma). I used 2 different packages since, from the loess docs it is clear that prior weights are set based on x distance between points. weightedloess function allow the user to give priors in order to perform regression.
Below is reported the basic sintax adopted to perform regression and to generate images.
C) loess(Count ~ Integrity),degree=2,span=0.1)
D) loess(Count ~ 1:nrow(df)),weigths=’Integrity’,degree=2,span=0.1)
E) weightedLowess(x=1:nrow(df), y=Count, weigths=’Integrity’, span=0.1)
F) weightedLowess(x=1:nrow(df), y=order(Count), weigths=’Integrity’, span=0.1)
Please find enclosed images cited in the question.
Sample Images

Divide a set of curves into groups using functional data analysis

I have a dataset containing about 500 curves. In my dataset every row is a curve (which comes from some experimental measurements) and in the columns there are the measurement intervals (I don't think it's important, but intervals are not time measurements but frequency measurements).
Here you can find the data:
https://drive.google.com/file/d/1q1F1any8RlCIrn-CcQEzLWyrsyTBCCNv/view?usp=sharing
curves
t1
t2
1
-57.48
-57.56
2
-56.22
-56.28
3
-57.06
-57.12
I want to divide this dataset into 2 - 4 homogeneous groups of curves.
I've seen that there are some packages in R (fda and funHDDC) that allow you to find clusters but I don't know how to create the list with which to start the analysis, and I also don't understand why the initial dataset doesn't fit. How can I transform the data I have into a list suitable for processing with the above packages?
What results should I expect?

How to specify subset/ sample number for permutations using specaccum() in R's vegan package

I have a community matrix (species as columns, samples as rows) from which I would like to generate a species accumulation curve (SAC) using the specaccum() and fitspecaccum() functions in R's vegan package. In order for the resulting SAC and cumulative species richness at sample X to be comparable among regions (I have 1 community matrix per region), I need to have specaccum() choose the same number of sets within each region. My problem is that some regions have a larger number of sets than others. I would like to limit the sample size to the minimum number of sets among regions (in my case, the minimum number of sets is 45, so I would like specaccum() to randomly sample 45 sets, 100 times (set permutations=100) for each region. I would like to sample from the entire data set available for each region. The code below has not worked... it doesn't recognize "subset=45". The vegan package info says "subset" needs to be logical... I don't understand how subset number can be logical, but maybe I am misinterpreting what subset is... Is there another way to do this? Would it be sufficient to run specaccum() for the entire number of sets available for each region and then just truncate the output to 45?
require(vegan)
pool1<-specaccum(comm.matrix, gamma="jack1", method="random", subet=45, permutations=100)
Any help is much appreciated.
Why do you want to limit the function to work in a random sample of 45 cases? Just use the species accumulation up to 45 cases. Taking a random subset of 45 cases gives you the same accumulation, except for the random error of subsampling and throwing away information. If you want to compare your different cases, just compare them at the sample size that suits all cases, that is, at 45 or less. That is the idea of species accumulation models.
The subset is intended for situations where you have (possibly) heterogeneous collection of sampling units, and you want to stratify data. For instance, if you want to see only the species accumulation in the "OldLow" habitat type of the Barro Colorado data, you could do:
data(BCI, BCI.env)
plot(specaccum(BCI, subset = BCI.env$Habitat == "OldLow"))
If you want to have, say, a subset of 30 sample plots of the same data, you could do:
take <- c(rep(TRUE, 30), rep(FALSE, 20))
plot(specaccum(BCI)) # to see it all
# repeat the following to see how taking subset influences
lines(specaccum(BCI, subset = sample(take)), col = "blue")
If you repeat the last line, you see how taking random subset influences the results: the lines are normally within the error bars of all data, but differ from each other due to random error.

How I can plot a expected richness curve (Chao1) via vegan in R

I have a dataset from one site containing data on species and its abundance (number of individuals for each species in sample).
I use vegan package for alpha diversity analysis.
For instance, I plot a species rarefraction curve via rarecurve function (I cann't use a specaccum function becouse I have data from one site), and calculate a Chao1 index via estimateR function.
How I can plot a Chao1 expected richness curve using estimateR function? Then, I would like to combine these curves on one single plot.
library(vegan)
TR <- matrix(nrow=1,c(3,1,1,17,1,1,1,1,1,2,1,1,3,13,31,24,6,1,1,4,1,10,2,3,1,5,6,1,1,1,4,16,17,15,6,9,66,3,1,3,24,15,2,3,17,1,7,2,27,13,2,1,1,3,1,3,30,7,1,1,4,1,2,5,1,1,6,2,1,9,11,5,8,7,2,2,2,1,13,3,8,4,1,5,27,1,62,13,6,7,7,4,9,1,7,7,1,25,1,5,3,1,2,1,1,5,2,73,25,17,43,88,2,3,38,4,5,6,6,16,2,13,10,7,1,2,9,3,1,3,1,8,4,4,5,13,2,25,9,2,1,12,29,4,1,9,1,1,3,4,2,9,4,26,2,7,4,18,1,10,10,4,6,5,20,1,2,11,1,3,1,2,1,1,12,3,2,1,4,24,7,22,19,43,2,9,18,1,1,1,9,7,6,1,8,2,2,19,7,26,4,4,1,3,4,5,2,4,8,2,3,1,5,5,1,11,6,6,2,4,3,1,10,6,9,16,1,1,32,1,1,31,2,12,2,13,1,2,9,13,1,11,8,1,14,5,9,1,3,1,7,1,1,13,17,1,1,3,2,9,1,4,1,7,2,2,9,24,20,2,1,2,2,1,9,5,1,1,23,13,7,1,8,5,47,32,6,13,16,8,2,1,5,4,3,1,2,1,1,1,3,14,6,21,2,7,2,2,16,2,10,21,18,2,1,3,33,12,55,4,1,5,14,3,10,2,4,1,2,5,7,6,2,12,14,28,18,30,28,7,1,1,1,3,4,2,17,60,31,3,3,2,2,3,6,2,6,1,13,2,3,13,7,2,10,19,9,7,1,3))
num_species=specnumber(TR)
chao1=estimateR(TR)[2,]
shannon=diversity(TR,"shannon")
rarecurve(TR)
estimateR(TR)
Here is a plot, building on EstimateS output (I input the same data) with SigmaPlot:
Thin line is expected richness - Chao1. In R I can plot only SAC.
In EstimateS I get a set with data for all 2990 individuals, but not in R.
I don't know how things are done in estimateS, but it looks like the extended richness (Chao 1) curve is based on the mean of random subsamples of the community. This could be done like this:
subchao <- sapply(1:2990, function(i)
mean(sapply(1:100, function(...) estimateR(rrarefy(TR, i))[2,])))
This would randomly rarefy (rrarefy()) to all sample sizes from 1 to 2990 and find the mean from 100 replicates of each. This will take time.

Clustering and Heatmap on microarray data using R

I have a file with the results of a microarray expression experiment. The first column holds the gene names. The next 15 columns are 7 samples from the post-mortem brain of people with Down's syndrome, and 8 from people not having Down's syndrome. The data are normalized. I would like to know which genes are differentially expressed between the groups.
There are two groups and the data is nearly normally distributed, so a t-test has been performed for each gene. The p-values were added in another column at the end. Afterwards, I did a correction for multiple testing.
I need to cluster the data to see if the differentially expressed genes (FDR<0.05) can discriminate between the groups.
Also, I would like to visualize the clustering using a heatmap with gene names on the rows and some meaningful names on the samples (columns)
I have written this code for the moment:
ds <- read.table("down_syndroms.txt", header=T)
names(ds) <- c("Gene",paste0("Down",1:7),paste0("Control",1:8), "pvalues")
pvadj <- p.adjust(ds$pvalue, method = "BH")
# # How many genes do we get with a FDR <=0.05
sum(pvadj<=0.05)
[1] 5641
# Cluster the data
ds_matrix<-as.matrix(ds[,2:18])
ds_dist_matrix<-dist(ds_matrix)
my_clustering<-hclust(ds_dist_matrix)
# Heatmap
library(gplots)
hm <- heatmap.2(ds_matrix, trace='none', margins=c(12,12))
The heatmap I have done doesn't look the way I would like. Also, I think I should remove the pvalues from it. Besides, R usually crashes when I try to plot the clustering (probably due to the big size of the data file, with more than 22 thousand genes).
How could I do a better looking tree (clustering) and heatmap?

Resources