Plotting Levenshtein distance scores in R - r

I'm trying to plot the Levenshtein distance scores between 2 list of sequences (amino acid sequences) using something other than a heatmap. This is a code I used to generate a heatmap as an example:
library (utils)
library (pheatmap)
dist_scores<-adist(LV #first list of sequence
,CD4 #second list of sequences, counts = TRUE)
colors = c("tomato","khaki1","darkseagreen2", "mediumseagreen", "gray30")
breaks <- c(0, 1,2,3,4,5)
pheatmap(dist_scores,breaks=breaks, color=colors, cluster_rows = T, cluster_cols = T)
and here is the heatmap from the example:
https://i.stack.imgur.com/4ay55.png
I want to have a more intuitive way of showing the data..
I'm thinking of plotting the data as nodes (representing different sequences) and edges (representing the distances..where the length of the edge increases as the score increases), and also color-code the nodes by whether they are from "LV" or "CD4". Is there a way to do this in R?
My coding skills are subpar at best so I would be really grateful for any help.
Thanks :)

IIRC, from my previous experience doing Bioinformatics decades ago, there are already good graphical representations to show similarity between DNA sequences. One option I recall uses sequence dot plots, and R has at least 2 packages for doing that: seqinr (https://cran.r-project.org/web/packages/seqinr/index.html) and dotplot (https://github.com/evolvedmicrobe/dotplot). One option that is not an R package but a web tool, is YASS (https://bioinfo.cristal.univ-lille.fr/yass/index.php).
For some alternative metrics and representations, see: https://ieeexplore.ieee.org/document/9097943 ("Levenshtein Distance, Sequence Comparison and Biological Database Search"), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4880953/ ("Similarity Estimation Between DNA Sequences Based on Local Pattern Histograms of Binary Images"), and https://pdf.sciencedirectassets.com/271876/1-s2.0-S0888613X07X01403/1-s2.0-S0888613X07000382/main.pdf ("Distance measures for biological sequences: Some recent approaches")

Related

interpret plot of multivariate time series clustering results from dtwclust

I'm using DTWCLUST package in r for multivariate time series clustering. Here's my code.
data("uciCT")
mvc <- tsclust(CharTrajMV, k = 4L, distance = "gak", seed = 390L)
plot(mvc)
The CharTrajMV data set has 100 observations with 3 variables. As I understand, clusters are determined based on 3 variables as opposed to univariate time series clustering.
Each cluster graph shows several similarly patterned time series (observations) belonging to that cluster. How is this graph drawn? There are 3 time series variables used for clustering, how does one pattern graph come out? I mean the input is 3-dimentional(variables) dataset, but the output is 1-dimentional.
Moreover, I can get the 3 variables's centroid for each cluster (using mvc#centroids)
plot(mvc, labels = list(nudge_x = -10, nudge_y = 1), type="centroids")
this code shows only one centroid for each cluster. Can I get 3 variables' centroid graphs for each cluster with plot option? or is this right approach?
This is covered in the documentation. Plotting so many different series in separate panes would get very congested, so, for multivariate plots, the variables are appended one after the other, and you get vertical dotted lines to see the place where that happened, maybe injecting some missing values in some places to account for differences in length. This does mean the x axis isn't so meaningful anymore, but it's only meant to be a quick visualization aid.

Count the number of clusters on a UMAP scatterplot (without providing labels)?

I have a dataframe with columns 'x' and 'y' corresponding to x/y coordinates of a scatterplot I've made using ggplot2. I'm looking for some way to ask, "how many clusters exist here?". I understand that maybe some user input may be required here for what you want to call a 'cluster'.
I have found some success using Seurat, because it contains a function to label clusters. However, it's more like finding the clusters that correspond to a vector of labels provided by the user (ex: I proivde 5 unique labels so just go find 5 clusters).
Min Reprex:
Seurat's LabelClusters function is very useful for labeling clusters starting solely from X/Y coordinates:
library(umap)
library(Seurat)
my_umap <- umap(iris[,c(1:4)])
my_umap <- as.data.frame(my_umap$layout)
my_umap$id <- iris$Species
colnames(my_umap) <- c("x","y", "id")
p <- ggplot(my_umap, aes(x=x,y=y,color=id)) + geom_point()
LabelClusters(plot=p, id='id', color="black")
Image of result
However, I have a need to detect total # of clusters from these data (without providing labels). By this I mean first detecting how many clusters exist . Maybe here we would see 5 clusters instead of 3 :
And here is that result
First of all, the number of clusters is very subjective. Different algorithms give different results and it is up to the data analyst to decide what is an appropriate clustering. There are some clusterings that are better than others, but what algorithm is optimal for calling clusters is dependent on the data. For example, if you generate random noise, clustering algorithms most likely will find clusters, but because the data is noise, these are meaningless.
I'm going to caution against calling clusters in UMAP space. UMAP is an nearest neighbour embedding method that is constrained to 2/3 dimensions. This constraint loses a lot of potentially useful information. I think a generally useful strategy is to compute a (shared) nearest neighbour graph on the data (or a number of principal components), through for example bluster::makeSNNGraph() and then use igraph::cluster_louvain(), or some other method, to call clusters.
library(ggplot2)
library(umap)
data <- as.matrix(iris[, 1:4])
my_umap <- as.data.frame(umap(data)$layout)
graph <- bluster::makeSNNGraph(data)
#> Warning in (function (to_check, X, clust_centers, clust_info, dtype, nn, :
#> detected tied distances to neighbors, see ?'BiocNeighbors-ties'
clust <- igraph::cluster_louvain(graph)
my_umap$clust <- factor(clust$membership)
ggplot(my_umap, aes(V1, V2, colour = clust)) +
geom_point()
Created on 2022-01-31 by the reprex package (v2.0.1)

rPhenograph file is a "large communities (68 elements, 2.5Mb) How to plot it? ggplot2?

I used some raw-output files from our flow cytometer which tells me in .csv which intensities it measures at which wavelength for every event/cell.
This resulted in a .csv with around 25000 cells and around 240 measuring points.
Importing the .csv file into R-Studio and removing some measurements yielded a matrix with 25000 obs x 73 variables.
Then I used rPhenograph to calculate the neighborhoods, which worked well.
But now it seems to be a dataframe or something that I genuinely have no idea how to plot it.
Data1 <- read_csv("CD4_3.csv", skip=17)
Data_selected <- select(Data2, ends_with(".A"))
rpheno_out <- Rphenograph(Data_selected)
I hoped to get a plot which looks/resembles a tSNE plot.
Instead, I only got an error-code telling me that ggplot can't handle it.
ggplot(rpheno_out) + geom_point()
Fehler: data must be a data frame, or other object coercible by
fortify(), not an S3 object with class communities
I think you've misunderstood what the Rphenograph() function returns ; the doc states :
A simple R implementation of the phenograph
PhenoGraph
algorithm, which is a clustering method designed for high-dimensional
single-cell data analysis. It works by creating a graph ("network")
representing phenotypic similarities between cells by calclating the
Jaccard coefficient between nearest-neighbor sets, and then
identifying communities using the well known Louvain
method in this graph.
This only builds clusters on your data based on the information you provide. The output has no dimensionally reduced version of your input.
If you want to see what your clustering looks like, then you have to apply your favorite dimension-reduction analysis and plot color-coding with the cluster info from Rphenograph().
To give you an example, I've done this on the provided code from the function's doc :
library(cytofkit)
## Example from Rphenograph's doc
iris_unique <- unique(iris) # Remove duplicates
data <- as.matrix(iris_unique[,1:4])
Rphenograph_out <- Rphenograph(data, k = 45)
## Added bit to see the results
pca <- prcomp(iris_unique[,1:4], retx = T, rank. = 2)
par(mfrow=c(1,2))
plot(pca$x, col=Rphenograph_out$membership, lwd=3,
main="Color by Rphenograph cluster")
plot(pca$x, col=iris_unique$Species, lwd=3,
main="Color by Species")
Results in :

Graphing results of dbscan in R

Your comments, suggestions, or solutions are/will be greatly appreciated, thank you.
I'm using the fpc package in R to do a dbscan analysis of some very dense data (3 sets of 40,000 points between the range -3, 6).
I've found some clusters, and I need to graph just the significant ones. The problem is that I have a single cluster (the first) with about 39,000 points in it. I need to graph all other clusters but this one.
The dbscan() creates a special data type to store all of this cluster data in. It's not indexed like a data frame would be (but maybe there is a way to represent it as such?).
I can graph the dbscan type using a basic plot() call. But, like I said, this will graph the irrelevant 39,000 points.
tl;dr:
how do I graph only specific clusters of a dbscan data type?
If you look at the help page (?dbscan) it is organized like all others into sections labeled Description, Usage, Arguments, Details and Value. The Value section describes what the function dbscan returns. In this case it is simply a list (a standard R data type) with a few components.
The cluster component is simply an integer vector whose length it equal to the number of rows in your data that indicates which cluster each observation is a member of. So you can use this vector to subset your data to extract only those clusters you'd like and then plot just those data points.
For example, if we use the first example from the help page:
set.seed(665544)
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
sd=0.2))
ds <- dbscan(x, 0.2)
we can then use the result, ds to plot only the points in clusters 1-3:
#Plot only clusters 1, 2 and 3
plot(x[ds$cluster %in% 1:3,])
Without knowing the specifics of dbscan, I can recommend that you look at the function smoothScatter. It it very useful for examining the main patterns in a scatterplot when you otherwise would have too many points to make sense of the data.
The probably most sensible way of plotting DBSCAN results is using alpha shapes, with the radius set to the epsilon value. Alpha shapes are closely related to convex hulls, but they are not necessarily convex. The alpha radius controls the amount of non-convexity allowed.
This is quite closely related to the DBSCAN cluster model of density connected objects, and as such will give you a useful interpretation of the set.
As I'm not using R, I don't know about the alpha shape capabilities of R. There supposedly is a package called alphahull, from a quick check on Google.

What techniques exists in R to visualize a "distance matrix"?

I wish to present a distance matrix in an article I am writing, and I am looking for good visualization for it.
So far I came across balloon plots (I used it here, but I don't think it will work in this case), heatmaps (here is a nice example, but they don't allow to present the numbers in the table, correct me if I am wrong. Maybe half the table in colors and half with numbers would be cool) and lastly correlation ellipse plots (here is some code and example - which is cool to use a shape, but I am not sure how to use it here).
There are also various clustering methods but they will aggregate the data (which is not what I want) while what I want is to present all of the data.
Example data:
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dist(nba[1:20, -1], )
I am open for ideas.
You could also use force-directed graph drawing algorithms to visualize a distance matrix, e.g.
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dist_m <- as.matrix(dist(nba[1:20, -1]))
dist_mi <- 1/dist_m # one over, as qgraph takes similarity matrices as input
library(qgraph)
jpeg('example_forcedraw.jpg', width=1000, height=1000, unit='px')
qgraph(dist_mi, layout='spring', vsize=3)
dev.off()
Tal, this is a quick way to overlap text over an heatmap. Note that this relies on image rather than heatmap as the latter offsets the plot, making it more difficult to put text in the correct position.
To be honest, I think this graph shows too much information, making it a bit difficult to read... you may want to write only specific values.
also, the other quicker option is to save your graph as pdf, import it in Inkscape (or similar software) and manually add the text where needed.
Hope this helps
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dst <- dist(nba[1:20, -1],)
dst <- data.matrix(dst)
dim <- ncol(dst)
image(1:dim, 1:dim, dst, axes = FALSE, xlab="", ylab="")
axis(1, 1:dim, nba[1:20,1], cex.axis = 0.5, las=3)
axis(2, 1:dim, nba[1:20,1], cex.axis = 0.5, las=1)
text(expand.grid(1:dim, 1:dim), sprintf("%0.1f", dst), cex=0.6)
A Voronoi Diagram (a plot of a Voronoi Decomposition) is one way to visually represent a Distance Matrix (DM).
They are also simple to create and plot using R--you can do both in a single line of R code.
If you're not famililar with this aspect of computational geometry, the relationship between the two (VD & DM) is straightforward, though a brief summary might be helpful.
Distance Matrices--i.e., a 2D matrix showing the distance between a point and every other point, are an intermediate output during kNN computation (i.e., k-nearest neighbor, a machine learning algorithm which predicts the value of a given data point based on the weighted average value of its 'k' closest neighbors, distance-wise, where 'k' is some integer, usually between 3 and 5.)
kNN is conceptually very simple--each data point in your training set is in essence a 'position' in some n-dimension space, so the next step is to calculate the distance between each point and every other point using some distance metric (e.g., Euclidean, Manhattan, etc.). While the training step--i.e., construcing the distance matrix--is straightforward, using it to predict the value of new data points is practically encumbered by the data retrieval--finding the closest 3 or 4 points from among several thousand or several million scattered in n-dimensional space.
Two data structures are commonly used to address that problem: kd-trees and Voroni decompositions (aka "Dirichlet tesselation").
A Voronoi decomposition (VD) is uniquely determined by a distance matrix--i.e., there's a 1:1 map; so indeed it is a visual representation of the distance matrix, although again, that's not their purpose--their primary purpose is the efficient storage of the data used for kNN-based prediction.
Beyond that, whether it's a good idea to represent a distance matrix this way probably depends most of all on your audience. To most, the relationship between a VD and the antecedent distance matrix will not be intuitive. But that doesn't make it incorrect--if someone without any statistics training wanted to know if two populations had similar probability distributions and you showed them a Q-Q plot, they would probably think you haven't engaged their question. So for those who know what they are looking at, a VD is a compact, complete, and accurate representation of a DM.
So how do you make one?
A Voronoi decomp is constructed by selecting (usually at random) a subset of points from within the training set (this number varies by circumstances, but if we had 1,000,000 points, then 100 is a reasonable number for this subset). These 100 data points are the Voronoi centers ("VC").
The basic idea behind a Voronoi decomp is that rather than having to sift through the 1,000,000 data points to find the nearest neighbors, you only have to look at these 100, then once you find the closest VC, your search for the actual nearest neighbors is restricted to just the points within that Voronoi cell. Next, for each data point in the training set, calculate the VC it is closest to. Finally, for each VC and its associated points, calculate the convex hull--conceptually, just the outer boundary formed by that VC's assigned points that are farthest from the VC. This convex hull around the Voronoi center forms a "Voronoi cell." A complete VD is the result from applying those three steps to each VC in your training set. This will give you a perfect tesselation of the surface (See the diagram below).
To calculate a VD in R, use the tripack package. The key function is 'voronoi.mosaic' to which you just pass in the x and y coordinates separately--the raw data, not the DM--then you can just pass voronoi.mosaic to 'plot'.
library(tripack)
plot(voronoi.mosaic(runif(100), runif(100), duplicate="remove"))
You may want to consider looking at a 2-d projection of your matrix (Multi Dimensional Scaling). Here is a link to how to do it in R.
Otherwise, I think you are on the right track with heatmaps. You can add in your numbers without too much difficulty. For example, building of off Learn R :
library(ggplot2)
library(plyr)
library(arm)
library(reshape2)
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
nba$Name <- with(nba, reorder(Name, PTS))
nba.m <- melt(nba)
nba.m <- ddply(nba.m, .(variable), transform,
rescale = rescale(value))
(p <- ggplot(nba.m, aes(variable, Name)) + geom_tile(aes(fill = rescale),
colour = "white") + scale_fill_gradient(low = "white",
high = "steelblue")+geom_text(aes(label=round(rescale,1))))
A dendrogram based on a hierarchical cluster analysis can be useful:
http://www.statmethods.net/advstats/cluster.html
A 2-D or 3-D multidimensional scaling analysis in R:
http://www.statmethods.net/advstats/mds.html
If you want to go into 3+ dimensions, you might want to explore ggobi / rggobi:
http://www.ggobi.org/rggobi/
In the book "Numerical Ecology" by Borcard et al. 2011 they used a function called *coldiss.r *
you can find it here: http://ichthyology.usm.edu/courses/multivariate/coldiss.R
it color codes the distances and even orders the records by dissimilarity.
another good package would be the seriation package.
Reference:
Borcard, D., Gillet, F. & Legendre, P. (2011) Numerical Ecology with R. Springer.
A solution using Multidimensional Scaling
data = read.csv("http://datasets.flowingdata.com/ppg2008.csv", sep = ",")
dst = tcrossprod(as.matrix(data[,-1]))
dst = matrix(rep(diag(dst), 50L), ncol = 50L, byrow = TRUE) +
matrix(rep(diag(dst), 50L), ncol = 50L, byrow = FALSE) - 2*dst
library(MASS)
mds = isoMDS(dst)
#remove {type = "n"} to see dots
plot(mds$points, type = "n", pch = 20, cex = 3, col = adjustcolor("black", alpha = 0.3), xlab = "X", ylab = "Y")
text(mds$points, labels = rownames(data), cex = 0.75)

Resources