Get clusters from PCA r - r

I have a PCA that shows two really big clusters and I dont know how to figure out which of my samples are in each cluster.
If it helps, Im using prcomp to generate the PCA:
pca1 <- autoplot(prcomp(df), label = TRUE, label.size = 2)
My approach has been to attempt to cluster the PCA output using kmeans with 2 groups to get the clusters:
pca <- prcomp(df, scale.=TRUE)
clust <- kmeans(pca$x[,1:2], centers=2)$cluster
I can then make a beautiful plot, but I am still lost as to which samples are in each cluster. For reference, here is the plot generate if I graph the kmeans output:
As you can see in the first PCA plot, the labels literally say which sample each dot is. My ideal output would be a two column txt file with the sample name in one column, and the group it belongs to in the other column.
All that aside, if there is a better way, please let me know.
Thanks in advance.
Here is a chunk of my data:
a b c b e
Sample_1013 312011 624559 625898 534309 220415
Sample_1046 474774 949458 951145 843049 366136
Sample_104 645363 1290450 1292520 919474 272200
Sample_1057 267319 534685 535294 690574 422645
Sample_106 414065 830571 834527 657354 234130
Sample_107 299289 602483 603756 566256 262153

In my question, clust is the name of the output from my kmeans:
clust <- kmeans(pca$x[,1:2], centers=2)$cluster
I typed clust into the terminal and got which samples belong to each group:
> clust
Sample_1013 Sample_1046 Sample_104 Sample_1057 Sample_106 Sample_107
1 1 1 1 1 1
Sample_1098 Sample_109 Sample_1109 Sample_1129 Sample_1130 Sample_1140
1 1 1 1 1 1
Sample_1149 Sample_115 Sample_118 Sample_1220 Sample_1223 Sample_1225
1 1 1 1 1 1
Hopefully this helps someone.

Related

Color the individuals of a R PCoA plot by groups

Should be a simple question, but I haven't found exactly how to do it so far.
I have a matrix as follow:
sample var1 var2 var3 etc.
1 5 7 3 1
2 0 1 6 8
3 7 6 8 9
4 5 3 2 4
I performed a PCoA using Vegan and plotted the results. Now my problem is that I want to color the samples according to a pre-defined group:
group sample
1 1
1 2
2 3
2 4
How can I import the groups and then plot the points colored according to the group tey belong to? It looks simple but I have been scratching my head over this.
Thanks!
Seb
You said you used vegan PCoA which I assume to mean wcmdscale function. The default vegan::wcmdscale only returns a scores matrix similarly as standard stats::cmdscale, but if you added some special arguments (such as eig = TRUE) you get a full wcmdscale result object with dedicated plot and points methods and you can do:
plot(<pcoa-result>, type="n") # no reproducible example: edit like needed
points(<pcoa-result>, col = group) # no reproducible example: group must be visible
If you have a modern vegan (2.5.x) the following also works:
library(magrittr)
plot(<full-pcoa-result>, type = "n") %>% points("sites", col = group)

Thin Plate Spline for 3D surface prediction in R

I tried this answer
get a surface plot in R
but it hasn't really helped.
I would like to perform a TPS (using Tps from Fields{}) on an XYZ dataframe where xy are co-orinates and z is a thickness. Then I would like to visualise the plot firstly before TPS and then after TPS..? Is this possible.
Then I would like to extract predicted thicknesses for a given set of new xy co-ordinates..?
Please let me know if this is possible
My Dataframe looks like this, dataframe is called LSP:
time PART MEAS PARTSUB XLOC YLOC
xxxx 1 1.956 a -3465 -94350
xxxx 1 1.962 a -3465 -53850
xxxx 1 1.951 a 50435 -40350
xxxx 1 1.958 a -57365 -40350
So I tried this:
LSP.spline <- Tps(LSP[,5:6], LSP$MEAS)
out.p <- predict.surface(LSP.spline, xy = c(1,2))
plot.surface(out.p, type="p")
But out.p is just NULL..?
so attempting the plot gives me:
Error in nrow(z) : argument "z" is missing, with no default
Any help is appreciated.
Paul.
predict.surface is now an obsolete / deprecated function. Use predictSurface instead.
fit<- Tps( BD[,1:4], BD$lnya) # fit surface to data
# evaluate fitted surface for first two
# variables holding other two fixed at median values
out.p<- predictSurface(fit)
surface(out.p, type="C")
Thanks for that - how about my second question....how can I extract predicted surface thickness values for a given set of XY locations..?
Use predict function. Have a read on ?predict.Tps. For the above example, let's say we want to predict at the first 4 locations in BD[, 1:4], we can do
predict(fit, x = BD[1:4, 1:4])
# [,1]
#[1,] 11.804124
#[2,] 11.804124
#[3,] 8.069056
#[4,] 9.501551
In general, pass x a two-column matrix.

display clusters in radial format

I have a list of clusters lets say from cluster 1 to cluster 3; along with
their membership for example below. I would like to display the clusters in radial format. I was thinking of using the as.phylo function
in the ape package to display this, but that requires creating a hclust object.If anyone knows how to do this thats much appreciated creating a hclust object or otherwise.
Many Thanks!
cl var numberOfCluster
1 a 1
1 b 1
1 c 1
1 d 1
1 a 2
1 b 2
2 c 2
2 d 2
3 a 3
1 b 3
2 c 3
2 d 3
Thanks very much!
(This is a copy of my answer to a similar question from "crossvalidated")
Assuming you can create hclust (from variables which can have a distance measure defined on them) - then it can be done by combining two new packages: circlize and dendextend.
The plot can be made using the circlize_dendrogram function (allowing for a much more refined control over the "fan" layout of the plot.phylo function).
# install.packages("dendextend")
# install.packages("circlize")
library(dendextend)
library(circlize)
# create a dendrogram
hc <- hclust(dist(datasets::mtcars))
dend <- as.dendrogram(hc)
# modify the dendrogram to have some colors in the branches and labels
dend <- dend %>%
color_branches(k=4) %>%
color_labels
# plot the radial plot
par(mar = rep(0,4))
# circlize_dendrogram(dend, dend_track_height = 0.8)
circlize_dendrogram(dend, labels_track_height = NA, dend_track_height = .4)

How to randomly select 2 vertices from a graph in R?

I'm new to R, and I'm trying to randomly select 2 vertices from a graph.
What I've done so far is:
First, set up a graph
edgePath <- "./project1/data/smalledges.csv"
edgesMatrix <- as.matrix(read.csv(edgePath, header = TRUE, colClasses = "character"))
graph <- graph.edgelist(edgesMatrix)
The smalledges.csv is a file look like this:
from to
4327231 2587908
Then I get all the vertices from the graph into a list:
vList <- as.list(get.data.frame(graph, what = c("vertices")))
After that, I try to use:
sample(vList, 2)
But what I've got is an error:
cannot take a sample larger than the population when 'replace = FALSE'
I guess it's because R thinks what I want is 2 random lists, so I tried this:
sample(vList, 2, replace = TRUE)
And then I've got 2 large lists... BUT THAT'S NOT WHAT I WANTED! So guys, how can I randomly select 2 vertices from my graph? Thanks!
Not clear from your question whether you want just the vertices, or a sub-graph containing those vertices. Here's an example of both.
library(igraph)
set.seed(1) # for reproducible example
g <- erdos.renyi.game(10, 0.3)
par(mfrow=c(1,3), mar=c(1,1,1,1))
set.seed(1) # for reproducible plot
plot(g)
# random sample of vertices
smpl <- sample(1:vcount(g),5)
V(g)[smpl] # 5 random vertices
# Vertex sequence:
# [1] 9 5 7 2 4
# change the color of only those vertices
V(g)[smpl]$color="lightgreen" # make them light green
set.seed(1) # for reproducible plot
plot(g)
# create a sub-graph with only those vertices, retaining edge structure
sub.g <- induced.subgraph(g,V(g)[smpl])
plot(sub.g)

output clustered similarity matrix

I have generated a pearson similarity matrix and plotted the results using pheatmap (clustered using hclust, method = "complete"). I'd like to output the ordered matrix, but in R the default seems to be just to alphabetize everything.
Here is my code:
df <- cor(t(genes), method = "pearson")
pheatmap(df, clustering_method = "complete")
head(genes)
pre early mid late end
AAC1 2.0059007 3.64679740 3.0092533 2.4936171 2.2693034
AAC3 -1.6843969 -1.62572636 -0.7654462 -1.5827481 -1.6059080
AAD10 2.6012529 2.05759631 1.3665322 1.4590833 0.3778324
AAD14 0.5047704 0.76021375 0.1825944 0.6111774 0.1174208
AAD15 7.6017557 8.52315453 7.2605744 6.9029452 5.9028824
AAD16 1.2018193 -0.03285354 0.2229450 -0.1337033 0.2198542
This what the current output (df) looks like:
A B C D
A 1 0.5 0.25 0.1
B 0.1 1 0.1 0.5
C 0.5 0.2 1 0.2
D 0 0.1 0.7 1
How can I output the similarity matrix as ordered by hclust?
I've looked, but I haven't been able to find anything that quite accomplishes what I need. Thanks in advance for your help!
(also sorry I don't know how to properly format everything yet)
EDIT: maybe some visuals would help. My clustered pheatmap output looks like this: ordered heatmap
I can see groups of genes that behave similarly, but because there are so many it's impossible/useless to read the labels. I want to find out which genes cluster together, but I can't output the ordered matrix.
When I plot the data without clustering it looks like this: unclustered heatmap
So the output/data I can get is pretty much useless for further analysis.

Resources