Get multiple polygons for scattered data in R - r

I have point cloud data of an area (x,y,z coordinates)
The plot of X and Y looks like:
I am trying to get polygons of different clusters in this data. I tried the following:
points <- df [,1:2] # x and y coordinates
pts <- st_as_sf(points, coords=c('X','Y'))
conc <- concaveman(pts, concavity = 0.5, length_threshold = 0)
Seems like I just get a single polygon binding the whole data. conc$polygons is a list of one variable.
How can I define multiple polygons? What am I missing when I am using concaveman and what all it can provide?

It's hard to tell from your example what variable defines your clusters. Below is an example with some simulated clusters using ggplot2 and data.table (adapted from here).
library(data.table)
library(ggplot2)
# Simulate data:
set.seed(1)
n_cluster = 50
centroids = cbind.data.frame(
x=rnorm(5, mean = 0, sd=5),
y=rnorm(5, mean = 0, sd=5)
)
dt = rbindlist(
lapply(
1:nrow(centroids),
function(i) {
cluster_dt = data.table(
x = rnorm(n_cluster, mean = centroids$x[i]),
y = rnorm(n_cluster, mean = centroids$y[i]),
cluster = i
)
}
)
)
dt[,cluster:=as.factor(cluster)]
# Find convex hull of each point by cluster:
hulls = dt[,.SD[chull(x,y)],by=.(cluster)]
# Plot:
p = ggplot(data = dt, aes(x=x, y=y, colour=cluster)) +
geom_point() +
geom_polygon(data = hulls,aes(fill=cluster,alpha = 0.5)) +
guides(alpha=F)
This produces the following output:
Edit
If you don't have predefined clusters, you can use a clustering algorithm. As a simple example, see below for a solution using kmeans with 5 centroids.
# Estimate clusters (e.g. kmeans):
dt[,km_cluster := as.factor(kmeans(.SD,5)$cluster),.SDcols=c("x","y")]
# Find convex hull of each point:
hulls = dt[,.SD[chull(x,y)],by=.(km_cluster)]
# Plot:
p = ggplot(data = dt, aes(x=x, y=y, colour=km_cluster)) +
geom_point() +
geom_polygon(data = hulls,aes(fill=km_cluster,alpha = 0.5)) +
guides(alpha=F)
In this case the output for the estimated clusters is almost equivalent to the constructed ones.

Related

T-SNE code text labelling of the clusters

Im using this code for running t-sne .
I want to do the t-sne on my whole data frame
So is there way to label my points that are being clustered and as well as label them with different colours to make them visually differentiable .
These are my samples CMP_6792" "CMP_7256" "CMP_7653" "GMP_6792" "GMP_7256" "GMP_7653" "HSC_6792"
"HSC_7256" "HSC_7653" "Mono_6792" "Mono_7256" "Mono_7653" "Gran1" "Gran2
I would like to label my points according to the above mentioned sample.
Here is my code
file1<- read.csv('PRIMARY_CELL_EPILIST.csv')
head(file1)
names(file1)
class(file1)
dat <- data.frame(file1)
rownames(file1) <- make.names(file1[,1], unique = TRUE)
head(file1)
dim(file1)
data <- file1[,2:15]
head(data)
library(tsne)
tsne1 <- tsne(scale(data), perplexity = 10,max_iter = 300)
plot(tsne1[, 1], tsne1[, 2])
library(ggplot2)
plotdata <- data.frame(tsne_x = tsne1[, 1], tsne_y = tsne1[, 2])
plt1 <- ggplot(plotdata) + geom_point(aes(x = tsne_x, y = tsne_y))
plot(plt1)
So any help or suggestion as well as improvement over my code would be highly appreciated .
You will first want to cluster your t-SNE results. The cluster assignments will then serve as color assignment.
cl <- cluster::pam( tsne1 )
Modify your plotdata data.frame so that it includes everything (sample names, t-SNE coordinates, cluster assignments):
plotdata <- data.frame( tsne_x = tsne1[,1], tsne_y = tsne1[,2], SampleID = v,
Cluster = cl$clustering )
where v is the vector of sample names you provided (i.e., v <- c( "CMP_6792", "CMP_7256", "CMP_7653", ... ), or v <- rownames(tsne1) if it's available).
Finally, adjust your ggplot call to access the relevant columns in the data.frame:
plt1 <- ggplot( plotdata, aes( x = tsne_x, y = tsne_y, color = Cluster ) +
geom_point() + ggrepel::geom_text_repel( aes( label = SampleID ) )

created a nested cdf that doesn't reach 1

Here is some workable example of data I wish to plot:
set.seed(123)
x <- rweibull(n = 2000, shape = 2, scale = 10)
x <- round(x, digits = 0)
x <- sort(x, decreasing = FALSE)
y <- c(rep(0.1, times = 500),rep(0.25, times = 500),rep(0.4, times = 500),rep(0.85, times = 500))
z <- rbinom(n=2000, size=1, prob=y)
df1 <- data.frame(x,z)
I want to plot the overal fequency of z across x.
unlike a typical cdf, the function should not reach 1.0, but instead
sum(df1$z)/length(df1$z)
a ymax of 0.36 (721/2000).
using ggplot2 we can create a cdf of x with the following command:
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
But i want to extend this plot to show the cumulative percentage of z (as a function of 'x')
The end result should like like
EDIT
with some very poor data manipulation I am able to generate the something similiar to a cdf plot, but there must be a more beautiful and easy method using various packages and ggplot
mytable <- table(df1$x, df1$z)
mydf <- as.data.frame.matrix(mytable)
colnames(mydf) <- c("z_no", "z_yes")
mydf$A <- 1:length(mydf$z_no)
mydf$sum <- cumsum(mydf$z_yes)
mydf$dis <- mydf$sum/length(z)
plot(mydf$A, mydf$dis)
You can use the package dplyr to process the data as follows:
library(dplyr)
plot_data <- group_by(df1, x) %>%
summarise(z_num = sum(z)) %>%
mutate(cum_perc_z = cumsum(z_num)/nrow(df1))
This gives the same result as the data processing that you describe in your edit. Note, however, that I get sum(df1$z) = 796 and the maximal y value is thus 796/2000 = 0.398.
For the plot, you can use geom_step() to have a step function and add the horizontal line with geom_hline():
ggplot(plot_data, aes(x = x, y = cum_perc_z)) +
geom_step(colour = "red", size = 0.8) +
geom_hline(yintercept = max(plot_data$cum_perc_z))

R Subset of pam, Arrange multiple figures in one

I'm struggling with the following problem:
I use pam to cluster my dataset v in 7 clusters:
x <- pam(v,7)
I know that there is a vector clustering in x which contains the according numbers of clusters.
I would like to get a subset of x which only contains cluster 1.
Is this possible?
Edit:
Here is an example. Cluster iris in three clusters and plot them.
library(ggfortify)
library(cluster)
v <- iris[-5]
x <- pam(v,3)
autoplot(x, frame = TRUE, frame.type = 'norm')
The question: How can I plot only the first cluster? It should look like the first plot without cluster 2 and 3.
Edit: I think I found a solution. Therefore I don't use autoplot anymore but calculate the convex hull of every cluster and plot it.
library(cluster)
library(plyr)
library(ggplot2)
library(ggrepel)
find_hull <- function(df) df[chull(df$x, df$y),]
v<-iris[-5]
pp <- pam(v,3)
n<-princomp(pp$data, scores = TRUE, cor = ncol(pp$data) != 2)$scores
df<-data.frame(n[,1],n[,2],pp$clustering)
colnames(df)<-c("x","y","z")
hulls <- ddply(df, "z", find_hull)
p<-qplot(x,y,data=df,color=as.factor(z))+
geom_polygon(data=hulls, alpha=1, fill=NA)+
geom_text_repel(aes(label = rownames(df)),arrow = arrow(length = unit(0.00, 'inches'), angle = 0.00),size=5.5,colour="grey55")+
theme_classic(base_size = 16)+
theme(axis.line=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),axis.title.y=element_blank(),legend.position="none",
panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),plot.background=element_blank())
p
df2<-df[df$z==1,]
hulls <- ddply(df2, "z", find_hull)
p1<-qplot(x,y,data=df2,color=as.factor(z))+
geom_polygon(data=hulls, alpha=0.8, fill=NA)+
geom_text_repel(aes(label = rownames(df2)),arrow = arrow(length = unit(0.00, 'inches'), angle = 0.00),size=5.5,colour="grey25")+
theme_classic(base_size = 16)+
theme(axis.line=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),axis.title.y=element_blank(),legend.position="none",
panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),plot.background=element_blank())+
p1
Now I want to plot both figures in one device. I have already tried the multiplot from cookbook-r but it gives the error
Error: Aesthetics must be either length 1 or the same as the data (26): label, x, y
It must be because of the labels I guess.
I also tried
grid.arrange(p,p1, ncol=1)
from the gridExtra package but it gives the same error.
Is there any other option to arrange multiple figures with labels in one figure?

How to map a sphere of influence in R

I don't have any code yet because I am trying to figure out where to begin.
I am using map('state, 'texas) to draw Texas and am geoplotting universities on it. I want R to figure out the sphere of influence that university has with in the state and map it out.
Eventually I will geoplot high schools on the map as well and I would like for R to see who's sphere of influence that high school is in.
Does anyone know what package to begin with?
Your description matches with the concept of a voronoi diagram. It partitions an area into polygons based on the locations of points (e.g. your high schools). All the points in the polygon are closer to that particular high school than to all other high schools.
An example using ggplot2, copied from this link:
library(ggplot2)
library(deldir)
library(scales)
library(reshape2)
library(plyr)
# make fake points
n <- 50
k <- 4
mat <- cbind(rnorm(n), rnorm(n))
df <- as.data.frame(mat)
names(df) <- c('x','y')
# triangulate
xrng <- expand_range(range(df$x), .05)
yrng <- expand_range(range(df$y), .05)
deldir <- deldir(df, rw = c(xrng, yrng))
# voronoi
qplot(x, y, data = df) +
geom_segment(
aes(x = x1, y = y1, xend = x2, yend = y2), size = .25,
data = deldir$dirsgs, linetype = 2
) +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0))

PCA FactoMineR plot data

I'm running an R script generating plots of the PCA analysis using FactorMineR.
I'd like to output the coordinates for the generated PCA plots but I'm having trouble finding the right coordinates. I found results1$ind$coord and results1$var$coord but neither look like the default plot.
I found
http://www.statistik.tuwien.ac.at/public/filz/students/seminar/ws1011/hoffmann_ausarbeitung.pdf
and
http://factominer.free.fr/classical-methods/principal-components-analysis.html
but neither describe the contents of the variable created by the PCA
library(FactoMineR)
data1 <- read.table(file=args[1], sep='\t', header=T, row.names=1)
result1 <- PCA(data1,ncp = 4, graph=TRUE) # graphs generated automatically
plot(result1)
I found that $ind$coord[,1] and $ind$coord[,2] are the first two pca coords in the PCA object. Here's a worked example that includes a few other things you might want to do with the PCA output...
# Plotting the output of FactoMineR's PCA using ggplot2
#
# load libraries
library(FactoMineR)
library(ggplot2)
library(scales)
library(grid)
library(plyr)
library(gridExtra)
#
# start with a clean slate
rm(list=ls(all=TRUE))
#
# load example data
data(decathlon)
#
# compute PCA
res.pca <- PCA(decathlon, quanti.sup = 11:12, quali.sup=13, graph = FALSE)
#
# extract some parts for plotting
PC1 <- res.pca$ind$coord[,1]
PC2 <- res.pca$ind$coord[,2]
labs <- rownames(res.pca$ind$coord)
PCs <- data.frame(cbind(PC1,PC2))
rownames(PCs) <- labs
#
# Just showing the individual samples...
ggplot(PCs, aes(PC1,PC2, label=rownames(PCs))) +
geom_text()
# Now get supplementary categorical variables
cPC1 <- res.pca$quali.sup$coor[,1]
cPC2 <- res.pca$quali.sup$coor[,2]
clabs <- rownames(res.pca$quali.sup$coor)
cPCs <- data.frame(cbind(cPC1,cPC2))
rownames(cPCs) <- clabs
colnames(cPCs) <- colnames(PCs)
#
# Put samples and categorical variables (ie. grouping
# of samples) all together
p <- ggplot() + theme(aspect.ratio=1) + theme_bw(base_size = 20)
# no data so there's nothing to plot...
# add on data
p <- p + geom_text(data=PCs, aes(x=PC1,y=PC2,label=rownames(PCs)), size=4)
p <- p + geom_text(data=cPCs, aes(x=cPC1,y=cPC2,label=rownames(cPCs)),size=10)
p # show plot with both layers
# Now extract the variables
#
vPC1 <- res.pca$var$coord[,1]
vPC2 <- res.pca$var$coord[,2]
vlabs <- rownames(res.pca$var$coord)
vPCs <- data.frame(cbind(vPC1,vPC2))
rownames(vPCs) <- vlabs
colnames(vPCs) <- colnames(PCs)
#
# and plot them
#
pv <- ggplot() + theme(aspect.ratio=1) + theme_bw(base_size = 20)
# no data so there's nothing to plot
# put a faint circle there, as is customary
angle <- seq(-pi, pi, length = 50)
df <- data.frame(x = sin(angle), y = cos(angle))
pv <- pv + geom_path(aes(x, y), data = df, colour="grey70")
#
# add on arrows and variable labels
pv <- pv + geom_text(data=vPCs, aes(x=vPC1,y=vPC2,label=rownames(vPCs)), size=4) + xlab("PC1") + ylab("PC2")
pv <- pv + geom_segment(data=vPCs, aes(x = 0, y = 0, xend = vPC1*0.9, yend = vPC2*0.9), arrow = arrow(length = unit(1/2, 'picas')), color = "grey30")
pv # show plot
# Now put them side by side in a single image
#
grid.arrange(p,pv,nrow=1)
#
# Now they can be saved or exported...
Adding something extra to Ben's answer. You'll note in the first chart in Ben's response that the labels overlap somewhat. The pointLabel() function in the maptools package attempts to find locations for the labels without overlap. It's not perfect, but you can adjust the positions in the new dataframe (see below) to fine tune if you want. (Also, when you load maptools you get a note about gpclibPermit(). You can ignore it if you're concerned about the restricted licence). The first part of the script below is Ben's script.
# load libraries
library(FactoMineR)
library(ggplot2)
library(scales)
library(grid)
library(plyr)
library(gridExtra)
#
# start with a clean slate
# rm(list=ls(all=TRUE))
#
# load example data
data(decathlon)
#
# compute PCA
res.pca <- PCA(decathlon, quanti.sup = 11:12, quali.sup=13, graph = FALSE)
#
# extract some parts for plotting
PC1 <- res.pca$ind$coord[,1]
PC2 <- res.pca$ind$coord[,2]
labs <- rownames(res.pca$ind$coord)
PCs <- data.frame(cbind(PC1,PC2))
rownames(PCs) <- labs
#
# Now, the code to produce Ben's first chart but with less overlap of the labels.
library(maptools)
PCs$label=rownames(PCs)
# Base plot first for pointLabels() to get locations
plot(PCs$PC1, PCs$PC2, pch = 20, col = "red")
new = pointLabel(PCs$PC1, PCs$PC2, PCs$label, cex = .7)
new = as.data.frame(new)
new$label = PCs$label
# Then plot using ggplot2
(p = ggplot(data = PCs) +
geom_hline(yintercept = 0, linetype = 3, colour = "grey20") +
geom_vline(xintercept = 0, linetype = 3, colour = "grey20") +
geom_point(aes(PC1, PC2), shape = 20, col = "red") +
theme_bw())
(p = p + geom_text(data = new, aes(x, y, label = label), size = 3))
The result is:
An alternative is to use the biplot function from CoreR or biplot.psych from the psych package. This will put the components and the data onto the same figure.
For the decathlon data set, use principal and biplot from the psych package:
library(FactoMineR) #needed to get the example data
library(psych) #needed for principal
data(decathlon) #the data set
pc2 <- principal(decathlon[1:10],2) #just the first 10 columns
biplot(pc2,labels = rownames(decathlon),cex=.5, main="Biplot of Decathlon results")
#this is a call to biplot.psych which in turn calls biplot.
#adjust the cex parameter to change the type size of the labels.
This looks like:
!a biplot http://personality-project.org/r/images/olympic.biplot.pdf
Bill

Resources