Clustering and Heatmap on microarray data using R - r

I have a file with the results of a microarray expression experiment. The first column holds the gene names. The next 15 columns are 7 samples from the post-mortem brain of people with Down's syndrome, and 8 from people not having Down's syndrome. The data are normalized. I would like to know which genes are differentially expressed between the groups.
There are two groups and the data is nearly normally distributed, so a t-test has been performed for each gene. The p-values were added in another column at the end. Afterwards, I did a correction for multiple testing.
I need to cluster the data to see if the differentially expressed genes (FDR<0.05) can discriminate between the groups.
Also, I would like to visualize the clustering using a heatmap with gene names on the rows and some meaningful names on the samples (columns)
I have written this code for the moment:
ds <- read.table("down_syndroms.txt", header=T)
names(ds) <- c("Gene",paste0("Down",1:7),paste0("Control",1:8), "pvalues")
pvadj <- p.adjust(ds$pvalue, method = "BH")
# # How many genes do we get with a FDR <=0.05
sum(pvadj<=0.05)
[1] 5641
# Cluster the data
ds_matrix<-as.matrix(ds[,2:18])
ds_dist_matrix<-dist(ds_matrix)
my_clustering<-hclust(ds_dist_matrix)
# Heatmap
library(gplots)
hm <- heatmap.2(ds_matrix, trace='none', margins=c(12,12))
The heatmap I have done doesn't look the way I would like. Also, I think I should remove the pvalues from it. Besides, R usually crashes when I try to plot the clustering (probably due to the big size of the data file, with more than 22 thousand genes).
How could I do a better looking tree (clustering) and heatmap?

Related

How can I get the spatial correlation between two datsets in r?

I have two arrays:
data1=array(-10:30, c(2160,1080,12))
data2=array(-20:30, c(2160,1080,12))
#Add in some NAs
ind <- which(data1 %in% sample(data1, 1500))
data1[ind] <- NA
One is modelled global gridded data (lon,lat,month) and the other, global gridded observations (lon,lat,month).
I want to assess how 'skillful' the modelled data is at recreating the obs. I think the best way to do this is with a spatial correlation between the datasets. How can I do that?
I tried a straightforward x<-cor(data1,data2) but that just returned x<-NA_real_.
Then I was thinking that I probably have to break it up by month or season. So, just looking at one month x<-cor(data1[,,1],data2[,,1]) it returned a matrix of size 1080*1080 (most of which are NAs).
How can I get a spatial correlation between these two datasets? i.e. I want to see where the modelled data performs 'well' i.e. has high correlation with observations, or where it does badly (low correlation with observations).

HCPC in FactomineR: How to count individuals in Clusters?

the title says it all. I performed a multiple correspondence analysis (MCA) in FactomineR with factoshiny and did an HPCP afterwards. I now have 3 clusters on my 2 dimensions. While the factoshiny interface really helps visualize and navigate the analysis easily, I can't find a way to count the individuals in my clusters. Additionally, I would love to assign the clustervariables to the individuals on my dataset. Those operations are easily performed with hclust, but their algorithms don't work on categorical data.
##dummy dataset
x <- as.factor(c(1,1,2,1,3,4,3,2,1))
y <- as.factor(c(2,3,1,4,4,2,1,1,1))
z <- as.factor(c(1,2,1,1,3,4,2,1,1))
data <- data.frame(x,y,z)
# used packages
library(FactoMineR)
library(Factoshiny)
# the function used to open factoshiny in your browser
res.MCA <- Factoshiny(data)
# factoshiny code:
# res.MCA<-MCA(data,graph=FALSE)
# hcpc code in factoshiny
res.MCA<-MCA(data,ncp=8,graph=FALSE)
res.HCPC<-HCPC(res.MCA,nb.clust=3,consol=FALSE,graph=FALSE)
plot.HCPC(res.HCPC,choice='tree',title='Hierarchical tree')
plot.HCPC(res.HCPC,choice='map',draw.tree=FALSE,title='Factor map')
plot.HCPC(res.HCPC,choice='3D.map',ind.names=FALSE,centers.plot=FALSE,angle=60,title='Hierarchical tree on the factor map')
I now want a variable data$cluster with 3 levels so that I can count the individuals in the clusters.
To anyone encountering a similar problem, this helped:
res.HCPC$data.clust # returns all values and cluster membership for every individual
res.HCPC$data.clust[1,]$clust # for the first individual
table(res.HCPC$data.clust$clust) # gives table of frequencies per cluster

Can I use a subset of results from DESeq2 to calculate Bray-Curtis dissimilarity and then plot a PCoA?

I am having trouble using my results from DESeq2 when comparing the differential expression of bacterial genes between disease and control to then calculate the Bray-Curtis dissimilarity and subsequently plot a PCoA.
My output from DESeq2 I have saved as a data frame. It consists of 6000 rows which are the gene names, and two columns, one for p value ( all are <0.05) and one for log2FOldChange > 1. The data frame is called siggenes1. Do I need to normalize my data before running the Bray-Curtis and PCoA? I thought that this was already done through DESEq2, but looking at my code which I can provide, I haven't included normalisation=T when carrying out the DESeq2.
Or would I need to normalise using the sweep function the initial data prior to using DESeq2?
My code for Bray-Curtis Dissimilarity
vegDistOut=vegdist(t(siggenes1),"bray")
The above gets 1 value which is 0.995. Now I am a bit lost as to how I would devise code for plotting a PCoA with this, as my next bit of code is wrong.
pcoaOut=pcoa(vegDistOut)
Error in array(STATS, dims[perm]) : 'dims' cannot be of length 0
I cannot proceed anymore because of the above steps.
If anybody could please help, I would be really grateful.
Thank-you
welcome to StackOverflow.
Bray-Curtis similarity is usually used when you try to determine how similar the species composition of two samples is. A typical input would consist of species counts per sample from software like kraken, clark for high throughput data, or - in the case of 16S - Qiime2:
or dada2.
| Genus | Sample 1 | Sample 2 |
|---------------|-------------|------------|
| Pseudomonas | 200 | 100 |
| Streptococcus | 50 | 20 |
You can of course calculate this metric for gene expression data, but that is not something that's commonly done and I would need more info on the why you want to do that.
As far as I understand your description you are interested in visualizing distances between your samples' expression in a PCA plot. Using DESeq2, you could:
library(DESeq2)
# Get a DESeqDataSet from somewhere
dds <- DESeqDataSetFrom...(...)
# You don't need to run `DESeq()` on the dds for a PCA, just transform your data
# into a homoscedastic dataset with either VST or rlog
vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
rld <- rlogTransformation(dds, blind=TRUE)
# 'xxx' here takes the place of your condition of interest from your
# design data frame
plotPCA(vsd, intgroup=c('xxx'))
All right, let's say you want actually to have the genes in your PCA, not the samples. In that case you could take the transformed expression values from the VST or rlog object and run the PCA code yourself:
library(DESeq2)
library(ggplot2)
# Get gene expression post VST
vst_expr <- assay(vsd)
# Or - if you want to select some genes
vst_expr <- assay(vsd)[c(...), ]
# Perform PCA
pca <- prcomp(vst_expr)
# Calculate explained % variation
pvar_expl <- round(((pca$sdev ^ 2) / sum(pca$sdev ^ 2)) * 100, 2)
ggplot(as.data.frame(pca$x), aes(x = PC1, y = PC2)) +
geom_point() +
xlab(paste("PC1: ", pvar_expl[1], "%")) +
ylab(paste("PC2: ", pvar_expl[2], "%"))
For a final point, it is generally not advisable to select only a number of genes before performing exploratory data analysis, especially in the way that you are thinking of. You have tested these genes already for differential expression in DESeq2, so you know they are different. It's much better to perform a blind visualization using PCA's or heatmaps. Follow this to learn all you need about DESeq2 and also check out https://support.bioconductor.org/

Plotting different mixture model clusters in the same curve

I have two sets of data, one representing a healthy data set having 4 variables and 11,000 points and another representing a faulty set having 4 variables and 600 points. I have used R's package MClust to obtain GMM clustering for each data set separately. What I want to do is to obtain both clusters in the same frame so as to study them at the same time. How can that be done?
I have tried joining both the datasets but the result I am obtaining is not what I want.
The code in use is:
Dat4M <- Mclust(Dat3, G = 3)
Dat3 is where I am storing my dataset, Dat4M is where I store the result of Mclust. G = 3 is the number of Gaussian mixtures I want, which in this case is three. To plot the result, the following code is used:
plot(Dat4M)
The following is obtained when I apply the above code in my Healthy dataset:
The following is obtained when the above code is used on Faulty dataset:
Notice that in the faulty data density curve, consider the mixture of CCD and CCA, we see that there are two density points that have been obtained. Now, I want to place the same in the same block in the healthy data and study the differences.
Any help on how to do this will be appreciated.

How to specify subset/ sample number for permutations using specaccum() in R's vegan package

I have a community matrix (species as columns, samples as rows) from which I would like to generate a species accumulation curve (SAC) using the specaccum() and fitspecaccum() functions in R's vegan package. In order for the resulting SAC and cumulative species richness at sample X to be comparable among regions (I have 1 community matrix per region), I need to have specaccum() choose the same number of sets within each region. My problem is that some regions have a larger number of sets than others. I would like to limit the sample size to the minimum number of sets among regions (in my case, the minimum number of sets is 45, so I would like specaccum() to randomly sample 45 sets, 100 times (set permutations=100) for each region. I would like to sample from the entire data set available for each region. The code below has not worked... it doesn't recognize "subset=45". The vegan package info says "subset" needs to be logical... I don't understand how subset number can be logical, but maybe I am misinterpreting what subset is... Is there another way to do this? Would it be sufficient to run specaccum() for the entire number of sets available for each region and then just truncate the output to 45?
require(vegan)
pool1<-specaccum(comm.matrix, gamma="jack1", method="random", subet=45, permutations=100)
Any help is much appreciated.
Why do you want to limit the function to work in a random sample of 45 cases? Just use the species accumulation up to 45 cases. Taking a random subset of 45 cases gives you the same accumulation, except for the random error of subsampling and throwing away information. If you want to compare your different cases, just compare them at the sample size that suits all cases, that is, at 45 or less. That is the idea of species accumulation models.
The subset is intended for situations where you have (possibly) heterogeneous collection of sampling units, and you want to stratify data. For instance, if you want to see only the species accumulation in the "OldLow" habitat type of the Barro Colorado data, you could do:
data(BCI, BCI.env)
plot(specaccum(BCI, subset = BCI.env$Habitat == "OldLow"))
If you want to have, say, a subset of 30 sample plots of the same data, you could do:
take <- c(rep(TRUE, 30), rep(FALSE, 20))
plot(specaccum(BCI)) # to see it all
# repeat the following to see how taking subset influences
lines(specaccum(BCI, subset = sample(take)), col = "blue")
If you repeat the last line, you see how taking random subset influences the results: the lines are normally within the error bars of all data, but differ from each other due to random error.

Resources