R Subset of pam, Arrange multiple figures in one - r

I'm struggling with the following problem:
I use pam to cluster my dataset v in 7 clusters:
x <- pam(v,7)
I know that there is a vector clustering in x which contains the according numbers of clusters.
I would like to get a subset of x which only contains cluster 1.
Is this possible?
Edit:
Here is an example. Cluster iris in three clusters and plot them.
library(ggfortify)
library(cluster)
v <- iris[-5]
x <- pam(v,3)
autoplot(x, frame = TRUE, frame.type = 'norm')
The question: How can I plot only the first cluster? It should look like the first plot without cluster 2 and 3.
Edit: I think I found a solution. Therefore I don't use autoplot anymore but calculate the convex hull of every cluster and plot it.
library(cluster)
library(plyr)
library(ggplot2)
library(ggrepel)
find_hull <- function(df) df[chull(df$x, df$y),]
v<-iris[-5]
pp <- pam(v,3)
n<-princomp(pp$data, scores = TRUE, cor = ncol(pp$data) != 2)$scores
df<-data.frame(n[,1],n[,2],pp$clustering)
colnames(df)<-c("x","y","z")
hulls <- ddply(df, "z", find_hull)
p<-qplot(x,y,data=df,color=as.factor(z))+
geom_polygon(data=hulls, alpha=1, fill=NA)+
geom_text_repel(aes(label = rownames(df)),arrow = arrow(length = unit(0.00, 'inches'), angle = 0.00),size=5.5,colour="grey55")+
theme_classic(base_size = 16)+
theme(axis.line=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),axis.title.y=element_blank(),legend.position="none",
panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),plot.background=element_blank())
p
df2<-df[df$z==1,]
hulls <- ddply(df2, "z", find_hull)
p1<-qplot(x,y,data=df2,color=as.factor(z))+
geom_polygon(data=hulls, alpha=0.8, fill=NA)+
geom_text_repel(aes(label = rownames(df2)),arrow = arrow(length = unit(0.00, 'inches'), angle = 0.00),size=5.5,colour="grey25")+
theme_classic(base_size = 16)+
theme(axis.line=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),axis.title.y=element_blank(),legend.position="none",
panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),plot.background=element_blank())+
p1
Now I want to plot both figures in one device. I have already tried the multiplot from cookbook-r but it gives the error
Error: Aesthetics must be either length 1 or the same as the data (26): label, x, y
It must be because of the labels I guess.
I also tried
grid.arrange(p,p1, ncol=1)
from the gridExtra package but it gives the same error.
Is there any other option to arrange multiple figures with labels in one figure?

Related

Changing aesthetics in ggplot generated by svars package in R

I'm using the svars package to generate some IRF plots. The plots are rendered using ggplot2, however I need some help with changing some of the aesthetics.
Is there any way I can change the fill and alpha of the shaded confidence bands, as well as the color of the solid line? I know in ggplot2 you can pass fill and alpha arguments to geom_ribbon (and col to geom_line), just unsure of how to do the same within the plot function of this package's source code.
# Load Dataset and packages
library(tidyverse)
library(svars)
data(USA)
# Create SVAR Model
var.model <- vars::VAR(USA, lag.max = 10, ic = "AIC" )
svar.model <- id.chol(var.model)
# Wild Bootstrap
cores <- parallel::detectCores() - 1
boot.svar <- wild.boot(svar.model, n.ahead = 30, nboot = 500, nc = cores)
# Plot the IRFs
plot(boot.svar)
I'm also looking at the command for a historical decomposition plot (see below). Is there any way I could omit the first two facets and plot only the bottom three lines on the same facet?
hist.decomp <- hd(svar.model, series = 1)
plot(hist.decomp)
Your first desired result is easily achieved by resetting the aes_params after calling plot. For your second goal. There is probably an approach to manipulate the ggplot object. Instead my approach below constructs the plot from scratch. Basically I copy and pasted the data wrangling code from vars:::plot.hd and filtered the prepared dataset for the desired series:
# Plot the IRFs
p <- plot(boot.svar)
p$layers[[1]]$aes_params$fill <- "pink"
p$layers[[1]]$aes_params$alpha <- .5
p$layers[[2]]$aes_params$colour <- "green"
p
# Helper to convert to long dataframe. Source: svars:::plot.hd
hd2PlotData <- function(x) {
PlotData <- as.data.frame(x$hidec)
if (inherits(x$hidec, "ts")) {
tsStructure = attr(x$hidec, which = "tsp")
PlotData$Index <- seq(from = tsStructure[1], to = tsStructure[2],
by = 1/tsStructure[3])
PlotData$Index <- as.Date(yearmon(PlotData$Index))
}
else {
PlotData$Index <- 1:nrow(PlotData)
PlotData$V1 <- NULL
}
dat <- reshape2::melt(PlotData, id = "Index")
dat
}
hist.decomp <- hd(svar.model, series = 1)
dat <- hd2PlotData(hist.decomp)
dat %>%
filter(grepl("^Cum", variable)) %>%
ggplot(aes(x = Index, y = value, color = variable)) +
geom_line() +
xlab("Time") +
theme_bw()
EDIT One approach to change the facet labels is via a custom labeller function. For a different approach which changes the facet labels via the data see here:
myvec <- LETTERS[1:9]
mylabel <- function(labels, multi_line = TRUE) {
data.frame(variable = labels)
}
p + facet_wrap(~variable, labeller = my_labeller(my_labels))

Get multiple polygons for scattered data in R

I have point cloud data of an area (x,y,z coordinates)
The plot of X and Y looks like:
I am trying to get polygons of different clusters in this data. I tried the following:
points <- df [,1:2] # x and y coordinates
pts <- st_as_sf(points, coords=c('X','Y'))
conc <- concaveman(pts, concavity = 0.5, length_threshold = 0)
Seems like I just get a single polygon binding the whole data. conc$polygons is a list of one variable.
How can I define multiple polygons? What am I missing when I am using concaveman and what all it can provide?
It's hard to tell from your example what variable defines your clusters. Below is an example with some simulated clusters using ggplot2 and data.table (adapted from here).
library(data.table)
library(ggplot2)
# Simulate data:
set.seed(1)
n_cluster = 50
centroids = cbind.data.frame(
x=rnorm(5, mean = 0, sd=5),
y=rnorm(5, mean = 0, sd=5)
)
dt = rbindlist(
lapply(
1:nrow(centroids),
function(i) {
cluster_dt = data.table(
x = rnorm(n_cluster, mean = centroids$x[i]),
y = rnorm(n_cluster, mean = centroids$y[i]),
cluster = i
)
}
)
)
dt[,cluster:=as.factor(cluster)]
# Find convex hull of each point by cluster:
hulls = dt[,.SD[chull(x,y)],by=.(cluster)]
# Plot:
p = ggplot(data = dt, aes(x=x, y=y, colour=cluster)) +
geom_point() +
geom_polygon(data = hulls,aes(fill=cluster,alpha = 0.5)) +
guides(alpha=F)
This produces the following output:
Edit
If you don't have predefined clusters, you can use a clustering algorithm. As a simple example, see below for a solution using kmeans with 5 centroids.
# Estimate clusters (e.g. kmeans):
dt[,km_cluster := as.factor(kmeans(.SD,5)$cluster),.SDcols=c("x","y")]
# Find convex hull of each point:
hulls = dt[,.SD[chull(x,y)],by=.(km_cluster)]
# Plot:
p = ggplot(data = dt, aes(x=x, y=y, colour=km_cluster)) +
geom_point() +
geom_polygon(data = hulls,aes(fill=km_cluster,alpha = 0.5)) +
guides(alpha=F)
In this case the output for the estimated clusters is almost equivalent to the constructed ones.

Combine a ggplot2 object with a lattice object in one plot

I would like to combine a ggplot2 with a lattice plot object. Since both packages are based on grid I was wondering whether this is possible? Ideally, I would do everything in ggplot2 but I cannot plot a 3d scatter.
So assume I have the following data:
set.seed(1)
mdat <- data.frame(x = rnorm(100), y = rnorm(100), z = rnorm(100),
cluster = factor(sample(5, 100, TRUE)))
First, I want to create a scatterplot matrix in ggplot2:
library(ggplot2)
library(gtools)
library(plyr)
cols <- c("x", "y", "z")
allS <- adply(combinations(3, 2, cols), 1, function(r)
data.frame(cluster = mdat$cluster,
var.x = r[1],
x = mdat[[r[1]]],
var.y = r[2],
y = mdat[[r[2]]]))
sc <- ggplot(allS, aes(x = x, y = y, color = cluster)) + geom_point() +
facet_grid(var.x ~ var.y)
So far so good. Now I want to create a lattice 3d scatterplot with all the variables together:
library(lattice)
sc3d <- cloud(z ~ x + y, data = mdat, groups = cluster)
Now I would like to combine sc and sc3d in one single plot. How can I achieve that? Maybe with the help of grid or gridExtra (pushViewport, arrangeGrob?)? Or can I produce a 3d scatterplot in ggplot? Ideally, I would like to see the 3d plot in the empty panel pf the ggplot but I guess that's asked even too much, so for starters I would be very happy to learn how we could arrange these two plots side by side.
library(gridExtra); library(lattice); library(ggplot2)
grid.arrange(xyplot(1~1), qplot(1,1))
You can replace the empty panel by the lattice grob within the gtable, but it doesn't look very good due to the axes etc.
g <- ggplotGrob(sc)
lg <- gridExtra:::latticeGrob(sc3d)
ids <- which(g$layout$name == "panel")
remove <- ids[2]
g$grobs[[remove]] <- lg
grid.newpage()
grid.draw(g)

Plotting envfit vectors (vegan package) in ggplot2

I am working on finalizing a NMDS plot that I created in vegan and ggplot2 but cannot figure out how to add envfit species-loading vectors to the plot. When I try to it says "invalid graphics state".
The example below is slightly modified from another question (Plotting ordiellipse function from vegan package onto NMDS plot created in ggplot2) but it expressed exactly the example I wanted to include since I used this question to help me get metaMDS into ggplot2 in the first place:
library(vegan)
library(ggplot2)
data(dune)
# calculate distance for NMDS
NMDS.log<-log(dune+1)
sol <- metaMDS(NMDS.log)
# Create meta data for grouping
MyMeta = data.frame(
sites = c(2,13,4,16,6,1,8,5,17,15,10,11,9,18,3,20,14,19,12,7),
amt = c("hi", "hi", "hi", "md", "lo", "hi", "hi", "lo", "md", "md", "lo",
"lo", "hi", "lo", "hi", "md", "md", "lo", "hi", "lo"),
row.names = "sites")
# plot NMDS using basic plot function and color points by "amt" from MyMeta
plot(sol$points, col = MyMeta$amt)
# same in ggplot2
NMDS = data.frame(MDS1 = sol$points[,1], MDS2 = sol$points[,2])
ggplot(data = NMDS, aes(MDS1, MDS2)) +
geom_point(aes(data = MyMeta, color = MyMeta$amt))
#Add species loadings
vec.sp<-envfit(sol$points, NMDS.log, perm=1000)
plot(vec.sp, p.max=0.1, col="blue")
The problem with the (otherwise excellent) accepted answer, and which explains why the vectors are all of the same length in the included figure [Note that the accepted Answer has now been edited to scale the arrows in the manner I describe below, to avoid confusion for users coming across the Q&A], is that what is stored in the $vectors$arrows component of the object returned by envfit() are the direction cosines of the fitted vectors. These are all of unit length, and hence the arrows in #Didzis Elferts' plot are all the same length. This is different to the output from plot(envfit(sol, NMDS.log)), and arises because we scale the vector arrow coordinates by the correlation with the ordination configuration ("axes"). That way, species that show a weak relationship with the ordination configuration get shorter arrows. The scaling is done by multiplying the direction cosines by sqrt(r2) where r2 are the values shown in the table of printed output. When adding the vectors to an existing plot, vegan also tries to scale the set of vectors such that they fill the available plot space whilst maintaining the relative lengths of the arrows. How this is done is discussed in the Details section of ?envfit and requires the use of the un-exported function vegan:::ordiArrowMul(result_of_envfit).
Here is a full working example that replicates the behaviour of plot.envfit using ggplot2:
library(vegan)
library(ggplot2)
library(grid)
data(dune)
# calculate distance for NMDS
NMDS.log<-log1p(dune)
set.seed(42)
sol <- metaMDS(NMDS.log)
scrs <- as.data.frame(scores(sol, display = "sites"))
scrs <- cbind(scrs, Group = c("hi","hi","hi","md","lo","hi","hi","lo","md","md",
"lo","lo","hi","lo","hi","md","md","lo","hi","lo"))
set.seed(123)
vf <- envfit(sol, NMDS.log, perm = 999)
If we stop at this point and look at vf:
> vf
***VECTORS
NMDS1 NMDS2 r2 Pr(>r)
Belper -0.78061195 -0.62501598 0.1942 0.174
Empnig -0.01315693 0.99991344 0.2501 0.054 .
Junbuf 0.22941001 -0.97332987 0.1397 0.293
Junart 0.99999981 -0.00062172 0.3647 0.022 *
Airpra -0.20995196 0.97771170 0.5376 0.002 **
Elepal 0.98959723 0.14386566 0.6634 0.001 ***
Rumace -0.87985767 -0.47523728 0.0948 0.429
.... <truncated>
So the r2 data is used to scale the values in columns NMDS1 and NMDS2. The final plot is produced with:
spp.scrs <- as.data.frame(scores(vf, display = "vectors"))
spp.scrs <- cbind(spp.scrs, Species = rownames(spp.scrs))
p <- ggplot(scrs) +
geom_point(mapping = aes(x = NMDS1, y = NMDS2, colour = Group)) +
coord_fixed() + ## need aspect ratio of 1!
geom_segment(data = spp.scrs,
aes(x = 0, xend = NMDS1, y = 0, yend = NMDS2),
arrow = arrow(length = unit(0.25, "cm")), colour = "grey") +
geom_text(data = spp.scrs, aes(x = NMDS1, y = NMDS2, label = Species),
size = 3)
This produces:
Start with adding libraries. Additionally library grid is necessary.
library(ggplot2)
library(vegan)
library(grid)
data(dune)
Do metaMDS analysis and save results in data frame.
NMDS.log<-log(dune+1)
sol <- metaMDS(NMDS.log)
NMDS = data.frame(MDS1 = sol$points[,1], MDS2 = sol$points[,2])
Add species loadings and save them as data frame. Directions of arrows cosines are stored in list vectors and matrix arrows. To get coordinates of the arrows those direction values should be multiplied by square root of r2 values that are stored in vectors$r. More straight forward way is to use function scores() as provided in answer of #Gavin Simpson. Then add new column containing species names.
vec.sp<-envfit(sol$points, NMDS.log, perm=1000)
vec.sp.df<-as.data.frame(vec.sp$vectors$arrows*sqrt(vec.sp$vectors$r))
vec.sp.df$species<-rownames(vec.sp.df)
Arrows are added with geom_segment() and species names with geom_text(). For both tasks data frame vec.sp.df is used.
ggplot(data = NMDS, aes(MDS1, MDS2)) +
geom_point(aes(data = MyMeta, color = MyMeta$amt))+
geom_segment(data=vec.sp.df,aes(x=0,xend=MDS1,y=0,yend=MDS2),
arrow = arrow(length = unit(0.5, "cm")),colour="grey",inherit_aes=FALSE) +
geom_text(data=vec.sp.df,aes(x=MDS1,y=MDS2,label=species),size=5)+
coord_fixed()
May i add something late?
Envfit provides pvalues, and sometimes you want to just plot the significant parameters (something vegan can do for you with p.=0.05 in the plot command). I struggled to do that with ggplot2. Here is my solution, maybe you find a more elegant one?
Starting from Didzis' answer from above:
ef<-envfit(sol$points, NMDS.log, perm=1000)
ef.df<-as.data.frame(ef$vectors$arrows*sqrt(ef$vectors$r))
ef.df$species<-rownames(ef.df)
#only significant pvalues
#shortcutting ef$vectors
A <- as.list(ef$vectors)
#creating the dataframe
pvals<-as.data.frame(A$pvals)
arrows<-as.data.frame(A$arrows*sqrt(A$r))
C<-cbind(arrows, pvals)
#subset
Cred<-subset(C,pvals<0.05)
Cred <- cbind(Cred, Species = rownames(Cred))
"Cred "can now be implemented in the geom_segment-argument as discussed above.
Short addition: To get a full representation of the plot.envfit functionality within ggplot2 aka "arrow lengths make full use of plot area" a factor needs to be applied. I don't know if it was intentionally left out in the answers above, as it was even specifically mentioned by Gavin? Just extract the required scaling factor using arrow_factor <- ordiArrowMul(vf) and then you can either apply it to both NMDS columns in spp.scrs or you can do this manually like
arrow_factor <- ordiArrowMul(vf)
spp.scrs <- as.data.frame(scores(vf, display = "vectors")) * arrow_factor
spp.scrs <- cbind(spp.scrs, Species = rownames(spp.scrs), Pvalues = vf$vectors$pvals, R_squared = vf$vectors$r)
# select significance similarly to `plot(vf, p.max = 0.01)`
spp.scrs <- subset(spp.scrs, Pvalues < 0.01)
# you can also add the arrow factor in here (don't do both!)
ggplot(scrs) +
geom_point(mapping = aes(x = NMDS1, y = NMDS2, colour = Group)) +
coord_fixed() + ## need aspect ratio of 1!
geom_segment(data = spp.scrs,
aes(x = 0, xend = NMDS1 * arrow_factor, y = 0, yend = NMDS2 * arrow_factor),
arrow = arrow(length = unit(0.25, "cm")), colour = "grey") +
geom_text(data = spp.scrs, aes(x = NMDS1 * arrow_factor, y = NMDS2 * arrow_factor, label = Species),
size = 3)

PCA FactoMineR plot data

I'm running an R script generating plots of the PCA analysis using FactorMineR.
I'd like to output the coordinates for the generated PCA plots but I'm having trouble finding the right coordinates. I found results1$ind$coord and results1$var$coord but neither look like the default plot.
I found
http://www.statistik.tuwien.ac.at/public/filz/students/seminar/ws1011/hoffmann_ausarbeitung.pdf
and
http://factominer.free.fr/classical-methods/principal-components-analysis.html
but neither describe the contents of the variable created by the PCA
library(FactoMineR)
data1 <- read.table(file=args[1], sep='\t', header=T, row.names=1)
result1 <- PCA(data1,ncp = 4, graph=TRUE) # graphs generated automatically
plot(result1)
I found that $ind$coord[,1] and $ind$coord[,2] are the first two pca coords in the PCA object. Here's a worked example that includes a few other things you might want to do with the PCA output...
# Plotting the output of FactoMineR's PCA using ggplot2
#
# load libraries
library(FactoMineR)
library(ggplot2)
library(scales)
library(grid)
library(plyr)
library(gridExtra)
#
# start with a clean slate
rm(list=ls(all=TRUE))
#
# load example data
data(decathlon)
#
# compute PCA
res.pca <- PCA(decathlon, quanti.sup = 11:12, quali.sup=13, graph = FALSE)
#
# extract some parts for plotting
PC1 <- res.pca$ind$coord[,1]
PC2 <- res.pca$ind$coord[,2]
labs <- rownames(res.pca$ind$coord)
PCs <- data.frame(cbind(PC1,PC2))
rownames(PCs) <- labs
#
# Just showing the individual samples...
ggplot(PCs, aes(PC1,PC2, label=rownames(PCs))) +
geom_text()
# Now get supplementary categorical variables
cPC1 <- res.pca$quali.sup$coor[,1]
cPC2 <- res.pca$quali.sup$coor[,2]
clabs <- rownames(res.pca$quali.sup$coor)
cPCs <- data.frame(cbind(cPC1,cPC2))
rownames(cPCs) <- clabs
colnames(cPCs) <- colnames(PCs)
#
# Put samples and categorical variables (ie. grouping
# of samples) all together
p <- ggplot() + theme(aspect.ratio=1) + theme_bw(base_size = 20)
# no data so there's nothing to plot...
# add on data
p <- p + geom_text(data=PCs, aes(x=PC1,y=PC2,label=rownames(PCs)), size=4)
p <- p + geom_text(data=cPCs, aes(x=cPC1,y=cPC2,label=rownames(cPCs)),size=10)
p # show plot with both layers
# Now extract the variables
#
vPC1 <- res.pca$var$coord[,1]
vPC2 <- res.pca$var$coord[,2]
vlabs <- rownames(res.pca$var$coord)
vPCs <- data.frame(cbind(vPC1,vPC2))
rownames(vPCs) <- vlabs
colnames(vPCs) <- colnames(PCs)
#
# and plot them
#
pv <- ggplot() + theme(aspect.ratio=1) + theme_bw(base_size = 20)
# no data so there's nothing to plot
# put a faint circle there, as is customary
angle <- seq(-pi, pi, length = 50)
df <- data.frame(x = sin(angle), y = cos(angle))
pv <- pv + geom_path(aes(x, y), data = df, colour="grey70")
#
# add on arrows and variable labels
pv <- pv + geom_text(data=vPCs, aes(x=vPC1,y=vPC2,label=rownames(vPCs)), size=4) + xlab("PC1") + ylab("PC2")
pv <- pv + geom_segment(data=vPCs, aes(x = 0, y = 0, xend = vPC1*0.9, yend = vPC2*0.9), arrow = arrow(length = unit(1/2, 'picas')), color = "grey30")
pv # show plot
# Now put them side by side in a single image
#
grid.arrange(p,pv,nrow=1)
#
# Now they can be saved or exported...
Adding something extra to Ben's answer. You'll note in the first chart in Ben's response that the labels overlap somewhat. The pointLabel() function in the maptools package attempts to find locations for the labels without overlap. It's not perfect, but you can adjust the positions in the new dataframe (see below) to fine tune if you want. (Also, when you load maptools you get a note about gpclibPermit(). You can ignore it if you're concerned about the restricted licence). The first part of the script below is Ben's script.
# load libraries
library(FactoMineR)
library(ggplot2)
library(scales)
library(grid)
library(plyr)
library(gridExtra)
#
# start with a clean slate
# rm(list=ls(all=TRUE))
#
# load example data
data(decathlon)
#
# compute PCA
res.pca <- PCA(decathlon, quanti.sup = 11:12, quali.sup=13, graph = FALSE)
#
# extract some parts for plotting
PC1 <- res.pca$ind$coord[,1]
PC2 <- res.pca$ind$coord[,2]
labs <- rownames(res.pca$ind$coord)
PCs <- data.frame(cbind(PC1,PC2))
rownames(PCs) <- labs
#
# Now, the code to produce Ben's first chart but with less overlap of the labels.
library(maptools)
PCs$label=rownames(PCs)
# Base plot first for pointLabels() to get locations
plot(PCs$PC1, PCs$PC2, pch = 20, col = "red")
new = pointLabel(PCs$PC1, PCs$PC2, PCs$label, cex = .7)
new = as.data.frame(new)
new$label = PCs$label
# Then plot using ggplot2
(p = ggplot(data = PCs) +
geom_hline(yintercept = 0, linetype = 3, colour = "grey20") +
geom_vline(xintercept = 0, linetype = 3, colour = "grey20") +
geom_point(aes(PC1, PC2), shape = 20, col = "red") +
theme_bw())
(p = p + geom_text(data = new, aes(x, y, label = label), size = 3))
The result is:
An alternative is to use the biplot function from CoreR or biplot.psych from the psych package. This will put the components and the data onto the same figure.
For the decathlon data set, use principal and biplot from the psych package:
library(FactoMineR) #needed to get the example data
library(psych) #needed for principal
data(decathlon) #the data set
pc2 <- principal(decathlon[1:10],2) #just the first 10 columns
biplot(pc2,labels = rownames(decathlon),cex=.5, main="Biplot of Decathlon results")
#this is a call to biplot.psych which in turn calls biplot.
#adjust the cex parameter to change the type size of the labels.
This looks like:
!a biplot http://personality-project.org/r/images/olympic.biplot.pdf
Bill

Resources