Is there any way accessing shifted matrix of a ComplexHeatmap? - r

I would like to draw a very huge heatmap by the great-great ComplexHeatmap library.
The initial matrix has more than 2000 rows and 50+ columns with integer values running from -3 to +3.
I encountered the Error: C stack usage is too close to the limit issue immediately - this might be the limitation of the underlying recursive (?) algorithm.
I found the jitter parameter as a solution - after some struggling with prlimit and ulimit stack adjustment.
So everything is almost ok now:
jitter parameter randomizing the clustering by rows per each execution.
So it is hard to check consistency of the resulted heatmaps in my pipeline.
I realized that I can access the input matrix.
For example:
hm <- Heatmap(input_matrix,
name = "Monstre Heatmap etc.",
# ... long parameter list ...
show_row_dend = TRUE,
jitter = TRUE,
# ... further parameters
)
> # accessing the inner data and compare with the input:
> identical(input_matrix, hm#matrix)
[1] TRUE
Is there any field to expose the shifted matrix in heatmap object?

Per request, this is reiterating my comments above:
The jitter introduced in the Heatmap function when you set jitter = TRUE can be kept reproducible when you set a fixed random seed prior to running the function, e.g. set.seed(123).
If jitter = TRUE, random values from uniform distribution between 0 and 1e-10 are generated, as per the Heatmap documentation, so you could probably just introduce jitter into the matrix yourself (using a defined random seed), prior to running Heatmap and get the same result, such as:
input_matrix <- input_matrix + runif(length(input_matrix), 0, 1e-10),
as you mentioned yourself already.
Regarding the memory issue, you may try to install and use fastcluster, which is a drop-in replacement for hclust which is faster and may consume less memory.
For ComplexHeatmap to use it, it may be necessary to run ht_opt("fast_hclust" = TRUE) prior to running Heatmap. To reset to defaults, use ht_opt(RESET = TRUE).
Concerning the legend, you can configure the limits of the color legend yourself (see: https://jokergoo.github.io/ComplexHeatmap-reference/book/legends.html)
As a side note, I found no issue when generating a 20000 x 50 matrix with 50% identical rows or columns with jitter = FALSE, so I am not sure your stack issues are directly caused by this.

Related

DBSCAN Clustering returning single cluster with noise points

I am trying to perform DBSCAN clustering on the data https://www.kaggle.com/arjunbhasin2013/ccdata. I have cleaned the data and applied the algorithm.
data1 <- read.csv('C:\\Users\\write\\Documents\\R\\data\\Project\\Clustering\\CC GENERAL.csv')
head(data1)
data1 <- data1[,2:18]
dim(data1)
colnames(data1)
head(data1,2)
#to check if data has empty col or rows
library(purrr)
is_empty(data1)
#to check if data has duplicates
library(dplyr)
any(duplicated(data1))
#to check if data has NA values
any(is.na(data1))
data1 <- na.omit(data1)
any(is.na(data1))
dim(data1)
Algorithm was applied as follows.
#DBSCAN
data1 <- scale(data1)
library(fpc)
library(dbscan)
set.seed(500)
#to find optimal eps
kNNdistplot(data1, k = 34)
abline(h = 4, lty = 3)
The figure shows the 'knee' to identify the 'eps' value. Since there are 17 attributes to be considered for clustering, I have taken k=17*2 =34.
db <- dbscan(data1,eps = 4,minPts = 34)
db
The result I obtained is "The clustering contains 1 cluster(s) and 147 noise points."
No matter whatever values I change for eps and minPts the result is same.
Can anyone tell where I have gone wrong?
Thanks in advance.
You have two options:
Increase the radius of your center points (given by the epsilon parameter)
Decrease the minimum number of points (minPts) to define a center point.
I would start by decreasing the minPts parameter, since I think it is very high and since it does not find points within that radius, it does not group more points within a group
A typical problem with using DBSCAN (and clustering in general) is that real data typically does not fall into nice clusters, but forms one connected point cloud. In this case, DBSCAN will always find only a single cluster. You can check this with several methods. The most direct method would be to use a pairs plot (a scatterplot matrix):
plot(as.data.frame(data1))
Since you have many variables, the scatterplot pannels are very small, but you can see that the points are very close together in almost all pannels. DBSCAN will connect all points in these dense areas into a single cluster. k-means will just partition the dense area.
Another option is to check for clusterability with methods like VAT or iVAT (https://link.springer.com/chapter/10.1007/978-3-642-13657-3_5).
library("seriation")
## calculate distances for a small sample
d <- dist(data1[sample(seq(nrow(data1)), size = 1000), ])
iVAT(d)
You will see that the plot shows no block structure around the diagonal indicating that clustering will not find much.
To improve clustering, you need to work on the data. You can remove irrelevant variables, you may have very skewed variables that should be transformed first. You could also try non-linear embedding before clustering.

R - DBSCAN fviz_cluster - get coordinates of elements with dim1 and dim2

I'm a noob with R, and I'm trying to do clustering on some data samples.
I've tried a PCA,
res.pca <- PCA(df,
ncp = 5, # nb composantes principales.
graph = TRUE,
)
and I can get the full elements list with new coordinates using
res.pca$ind
This is great and works perfectly
for info using the 2 first axis with the PCA, I've 80% of variability on one axis and a bit more than 10% on the Second axis. I was quite proud of the result considering that I've 30 variables ... and in the End the PCA implicitly says that 2 dimension will be enough.
Still working on those data I tried the DBSCAN Clustering method fpc::dbscan :
library (factoextra)
db <- fpc::dbscan(df, eps = 22, MinPts = 3)
and after doing the dbscan and graphing the clusters using fviz_cluster, the Two dimensions display says : 92.8% on axis 1 and 6.7% on axis 2!!!! (more than 99% of the total variance explained with 2 axis !
In short, the DBSCAN has transformed my 30 variables data in a way that looks to be better than the PCA. The overall clustering of DBSCAN is rubbish for my data, but the transformation that has been used is absolutely excellent.
My issue is that I would like to get access to those new coordinates ... but no way at this time...
the only accessible variables I can see are :
db$cluster, db$eps, db$Minpts, db$isseed
BUT I suspect that some data are accessible otherwize how fviz_cluster, could present the data.
Any Idea ?
The projection is not done by dbscan. fviz_cluster uses the first two components obtained via stats::prcomp on the data.

How to only plot edges above some minimum threshold value using markovchain in r

I have a large transition matrix that I want to plot a graph of in r. I have chosen the markovchain package to do this, which allows me to turn this matrix into a markovchain object and then plot it as follows:
library(markovchain)
tMat = matrix(c(0,.2,.7,.1,.3,.4,.3,.1,.4,.5),3,3)
mc = new("markovchain",transitionMatrix = tMat)
plot(mc)
which produces the following output:
of course, this is just an example, and as I mentioned before the real transition matrix is much hairier.
My question is: how can I plot only edges that have values greater than some minimum threshold? If I try to "zero out" all values below a certain threshold, markovchain complains that the rows do not sum to one (because it is then no longer a singularly sochastic matrix). But for a very complicated graph, it is less important that the edges connected to a vertex sum to 1, and more important that the graph remains readable. Is there any way to do this?
I know that the plot function is built on top of igraph.plot, so I am hoping that there is some option there that might help?
Any suggestions would be much appreciated!
-Paul
woops: i answered my own question. Just wanted to leave this here in case other people encounter the same problem: you can simply create the markovchain object, and then go into its transitionMatrix attribute and edit the values directly:
mc#transitionMatrix[mc#transitionMatrix<.2] = 0
which produces:
Now a god follow-up question which actually gets at the original problem and would be a better solution is: how to only suppress the numbers in teh graph output rather than deleting the lines altogether? It leads to ugly situations where previously connected nodes/vertices become islands. I think this would involve going into the part of the igraph.plot object that stores these values, which I don't know how to do even after reseraching quite a bit.
how to only suppress the numbers in teh graph output rather than
deleting the lines altogether?
Coerce the markovchain object to an igraph object; now you got all flexibility you need:
library(markovchain)
statesNames=c("a","b","c")
mc<-new("markovchain", states=statesNames, transitionMatrix=
matrix(c(0.2,0.5,0.3,
0,1,0,
0.1,0.8,0.1),nrow=3, byrow=TRUE, dimnames=list(statesNames,statesNames)
))
g <- as(mc, "igraph")
min <- 0.5
plot(g, edge.label=ifelse(E(g)$prob>=min, E(g)$prob, NA))

How to generate medoid plots

Hi I am using partitioning around medoids algorithm for clustering using the pam function in clustering package. I have 4 attributes in the dataset that I clustered and they seem to give me around 6 clusters and I want to generate a a plot of these clusters across those 4 attributes like this 1: http://www.flickr.com/photos/52099123#N06/7036003411/in/photostream/lightbox/ "Centroid plot"
But the only way I can draw the clustering result is either using a dendrogram or using
plot (data, col = result$clustering) command which seems to generate a plot similar to this
[2] : http://www.flickr.com/photos/52099123#N06/7036003777/in/photostream "pam results".
Although the first image is a centroid plot I am wondering if there are any tools available in R to do the same with a medoid plot Note that it also prints the size of each cluster in the plot. It would be great to know if there are any packages/solutions available in R that facilitate to do this or if not what should be a good starting point in order to achieve plots similar to that in Image 1.
Thanks
Hi All,I was trying to work out the problem the way Joran told but I think I did not understand it correctly and have not done it the right way as it is supposed to be done. Anyway this is what I have done so far. Following is how the file looks like that I tried to cluster
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155
following is the Pam clustering output
GRMZM2G181227
1
GRMZM2G146885
2
GRMZM2G139463
2
GRMZM2G015295
2
GRMZM2G111909
2
GRMZM2G078097
3
GRMZM2G450498
3
GRMZM2G413652
2
GRMZM2G090087
2
AC217811.3_FG003
2
Using the above two files I generated a third file that somewhat looks like this and has cluster information in the form of cluster type K1,K2,etc
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip Cluster_type
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722 K1
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648 K2
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344 K2
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755 K2
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883 K2
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945 K3
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794 K3
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949 K2
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155 K2
I certainly don't think that this is the file that joran would have wanted me to create but I could not think of anything else thus I ran lattice on the above file using the following code.
clusres<- read.table("clusinput.txt",header=TRUE,sep="\t");
jpeg(filename = "clusplot.jpeg", width = 800, height = 1078,
pointsize = 12, quality = 100, bg = "white",res=100);
parallel(~clusres[2:5]|Cluster_type,clusres,horizontal.axis=FALSE);
dev.off();
and I get a picture like this
Since I want one single line as the representative of the whole cluster at four different points this output is wrong moreover I tried playing with lattice but I can not figure out how to make it accept the Rpkm values as the X coordinate It always seems to plot so many lines against a maximum or minimum value at the Y coordinate which I don't understand what it is.
It will be great if anybody can help me out. Sorry If my question still seems absurd to you.
I do not know of any pre-built functions that generate the plot you indicate, which looks to me like a sort of parallel coordinates plot.
But generating such a plot would be a fairly trivial exercise.
Add a column of cluster labels (K1,K2, etc.) to your original data set, based on your clustering algorithm's output.
Use one of the many, many tools in R for aggregating data (plyr, aggregate, etc.) to calculate the relevant summary statistics by cluster on each of the four variables. (You haven't said what the first graph is actually plotting. Mean and sd? Median and MAD?)
Since you want the plots split into six separate panels, or facets, you will probably want to plot the data using either ggplot or lattice, both of which provide excellent support for creating the same plot, split across a single grouping vector (i.e. the clusters in your case).
But that's about as specific as anyone can get, given that you've provided so little information (i.e. no minimal runnable example, as recommended here).
How about using clusplot from package cluster with partitioning around medoids? Here is a simple example (from the example section):
require(cluster)
#generate 25 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
clusplot(pam(x, 2)) #`pam` does you partitioning

Difference between two density plots

Is there a simple way to plot the difference between two probability density functions?
I can plot the pdfs of my data sets (both are one-dimensional vectors with roughly 11000 values) on the same plot together to get an idea of the overlap/difference but it would be more useful to me if I could see a plot of the difference.
something along the lines of the following (though this obviously doesn't work):
> plot(density(data1)-density(data2))
I'm relatively new to R and have been unable to find what I'm looking for on any of the forums.
Thanks in advance
This should work:
plot(x =density(data1, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$x,
y= density(data1, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$y-
density(data2, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$y )
The trick is to make sure the densities have the same limits. Then you can plot their differences at the same locations.My understanding of the need for the identical limits comes from having made the error of not taking that step in answering a similar question on Rhelp several years ago. Too bad I couldn't remember the right arguments.
It looks like you need to spend a little time learning how to use R (or any other language, for that matter). Help files are your friend.
From the output of ?density :
Value [i.e. the data returned by the function]
If give.Rkern is true, the number R(K), otherwise an object with class
"density" whose underlying structure is a list containing the
following components.
x the n coordinates of the points where the density is estimated.
y the estimated density values. These will be non-negative, but can
be zero [remainder of "value" deleted for brevity]
So, do:
foo<- density(data1)
bar<- density(data2)
plot(foo$y-bar$y)

Resources