How to cluster within clusters

How to cluster within clusters - r

I have a set of points on a map, each with a given parameter value. I would like to:
Cluster them spatially and ignore any clusters having fewer than
10 points. My df should have a column (Clust) for the cluster each point belongs to [DONE]
Sub-cluster the parameter values within each cluster; add a column to my df (subClust) used to categorize each point by sub-cluster.
I don't know how to do the second part, except maybe with loops.
The image shows the set of spatially distributed points (top left) colour coded by cluster and sorted by parameter value in the top right plot. The bottom row shows clusters with >10 points (left) and facets for each cluster sorted by parameter value (right). It's these facets that I'd like to be able to colour code by sub-cluster according to a minimum cluster separation distance (d=1)
Any pointers/help appreciated. My reproducible code is below.
# TESTING
library(tidyverse)
library(gridExtra)
# Create a random (X, Y, Value) dataset
set.seed(36)
x_ex <- round(rnorm(200,50,20))
y_ex <- round(runif(200,0,85))
values <- rexp(200, 0.2)
df_ex <- data.frame(ID=1:length(y_ex),x=x_ex,y=y_ex,Test_Param=values)
# Cluster data by (X,Y) location
d = 4
chc <- hclust(dist(df_ex[,2:3]), method="single")
# Distance with a d threshold - used d=40 at one time but that changes...
chc.d40 <- cutree(chc, h=d)
# max(chc.d40)
# Join results
xy_df <- data.frame(df_ex, Clust=chc.d40)
# Plot results
breaks = max(chc.d40)
xy_df_filt <- xy_df %>% dplyr::group_by(Clust) %>% dplyr::mutate(n=n()) %>% dplyr::filter(n>10)# %>% nrow
p1 <- ggplot() +
geom_point(data=xy_df, aes(x=x, y=y, colour = Clust)) +
scale_color_gradientn(colours = rainbow(breaks)) +
xlim(0,100) + ylim(0,100)
p2 <- xy_df %>% dplyr::arrange(Test_Param) %>%
ggplot() +
geom_point(aes(x=1:length(Test_Param),y=Test_Param, colour = Test_Param)) +
scale_colour_gradient(low="red", high="green")
p3 <- ggplot() +
geom_point(data=xy_df_filt, aes(x=x, y=y, colour = Clust)) +
scale_color_gradientn(colours = rainbow(breaks)) +
xlim(0,100) + ylim(0,100)
p4 <- xy_df_filt %>% dplyr::arrange(Test_Param) %>%
ggplot() +
geom_point(aes(x=1:length(Test_Param),y=Test_Param, colour = Test_Param)) +
scale_colour_gradient(low="red", high="green") +
facet_wrap(~Clust, scales="free")
grid.arrange(p1, p2, p3, p4, ncol=2, nrow=2)
THIS SNIPPET DOES NOT WORK - can't pipe within dplyr mutate() ...
# Second Hierarchical Clustering: Try to sub-cluster by Test_Param within the individual clusters I've already defined above
xy_df_filt %>% # This part does not work
dplyr::group_by(Clust) %>%
dplyr::mutate(subClust = hclust(dist(.$Test_Param), method="single") %>%
cutree(, h=1))
Below is a way around it using a loop - but I'd really rather learn how to do this using dplyr or some other non-loop method. An updated image showing the sub-clustered facets follows.
sub_df <- data.frame()
for (i in unique(xy_df_filt$Clust)) {
temp_df <- xy_df_filt %>% dplyr::filter(Clust == i)
# Cluster data by (X,Y) location
a_d = 1
a_chc <- hclust(dist(temp_df$Test_Param), method="single")
# Distance with a d threshold - used d=40 at one time but that changes...
a_chc.d40 <- cutree(a_chc, h=a_d)
# max(chc.d40)
# Join results to main df
sub_df <- bind_rows(sub_df, data.frame(temp_df, subClust=a_chc.d40)) %>% dplyr::select(ID, subClust)
}
xy_df_filt_2 <- left_join(xy_df_filt,sub_df, by=c("ID"="ID"))
p4 <- xy_df_filt_2 %>% dplyr::arrange(Test_Param) %>%
ggplot() +
geom_point(aes(x=1:length(Test_Param),y=Test_Param, colour = subClust)) +
scale_colour_gradient(low="red", high="green") +
facet_wrap(~Clust, scales="free")
grid.arrange(p1, p2, p3, p4, ncol=2, nrow=2)

There should be a way to do it using a combination of do and tidy, but I always have a hard time getting things to line up the way I want using do. Instead, what I usually do is combine split from base R and map_dfr from purrr. split will split the dataframe by Clust and give you a list of dataframes that you can then map over. map_dfr maps over each of those dataframes and returns a single dataframe.
I started from your xy_df_filt and generated what I believe should be the same as the xy_df_filt_2 that you got from the for loop. I made two plots, although the two sets of clusters are a little hard to see.
xy_df_filt_2 <- xy_df_filt %>%
split(.$Clust) %>%
map_dfr(function(df) {
subClust <- hclust(dist(df$Test_Param), method = "single") %>% cutree(., h = 1)
bind_cols(df, subClust = subClust)
})
ggplot(xy_df_filt_2, aes(x = x, y = y, color = as.factor(subClust), shape = as.factor(Clust))) +
geom_point() +
scale_color_brewer(palette = "Set2")
Clearer with faceting
ggplot(xy_df_filt_2, aes(x = x, y = y, color = as.factor(subClust), shape = as.factor(Clust))) +
geom_point() +
scale_color_brewer(palette = "Set2") +
facet_wrap(~ Clust)
Created on 2018-04-14 by the reprex package (v0.2.0).

You could do this for your subclusters...
xy_df_filt_2 <- xy_df_filt %>%
group_by(Clust) %>%
mutate(subClust = tibble(Test_Param) %>%
dist() %>%
hclust(method="single") %>%
cutree(h=1))
Nested pipes are fine. I think the problem with your version was that you were not passing the right sort of object to dist.
The tibble term is not needed if you are only passing a single column to dist, but I have left it in in case you want to use several columns as you do for the main clustering.
You could use the same sort of formula, but without the group_by, to calculate xy_df from df_ex.

Related

ggforce: geom_mark_ellipse - How to move connectors?

I have some data for which I would like to circle some different subsets. I am using ggplot2 and ggforce to plot the data and draw an ellipse (geom_mark_ellipse) around the data.
I have an issue in that the positions of the connectors on the ellipses (for my data) are in ambiguous positions (at the conjunction of two ellipses, on the border of two ellipses that graze each other).
How can I manually set the position of the connector to the ellipse? Or at least influence them into a particular region?
I have some code below which captures the spirit in which I'm plotting my data. For the purpose of the example, how could I make all of the labels appear in the top left of the plot, or all join the ellipses at x == 0, -2, -4 for each of the factors?
library(tidyverse)
library(ggforce)
x <- c(-1,0,1,-3,-2,2,3,-5,-4,4,5)
t <- c(1,1,1,2,2,2,2,3,3,3,3)
tmp <- as_tibble_col(x, column_name = "x")
tmp <- tmp %>% mutate(t = t)
#How do I move the position of the label connectors on the ellipses?
tmp %>%
ggplot(aes(x=x, y=x)) +
geom_mark_ellipse(aes(label = t, group=t),con.cap = 0) +
geom_point()
Created on 2020-05-05 by the reprex package (v0.3.0)

I've managed to do it for my contrived example, yet to try on my real data, but there is hope.
As shown in the code below, I created data to fill the area (top left) that I didn't want to have labels in, and gave it a factor of "". I manually set the colour of the connectors to NA for that factor, and got rid of the label background for everything. Because the factor is "", the label is an empty string, and nothing shows up. I also set scale_colour_manual to give the colour NA to the ellipse I didn't want to see.
I also filtered the geom_point to not show the data with a factor of "". Finally, I deleted the legend.
library(tidyverse)
library(ggforce)
x <- c(-1,0,1,-3,-2,2,3,-5,-4,4,5)
t <- c(1,1,1,2,2,2,2,3,3,3,3)
tmp <- as_tibble_col(x, column_name = "x")
tmp <- tmp %>% mutate(y=x)
tmp <- tmp %>% mutate(t = t)
#now lets add some dodging data
tmp <- tmp %>% mutate(t = as.character(t))
tmp <- tmp %>% add_row(x=c(-5,2.5,-2.5), y=c(-2.5,5,2.5),t="")
tmp %>%
ggplot(aes(x=x, y=y)) +
geom_mark_ellipse(aes(label = t, group=t, colour=factor(t)),
con.cap = 0, con.colour = c(NA, "black","black","black"),
label.fill=NA) +
scale_colour_manual(values=c(NA, "black", "black", "black")) +
geom_point(data = subset(tmp, t != "")) +
theme(legend.position = "none")
Created on 2020-05-06 by the reprex package (v0.3.0)

how to loop a geographic mapping function over a list of dataframes (or a subsetted dataframe)

I have a dataframe consisting of species names, longitude and latitude coordinates. there are 115 different species with 25000 lat/long coordinates. I need to make individual maps that show observations for each specific species.
first, I created a function that would generate the kind of map that I want, called platmaps. when I call the function for my full dataset (platmaps(df1)), it creates a map displaying all lat long observations.
Then I constructed a for loop which was supposed to subset my df by species name, and insert that subsetted dataframe into my platmaps function. It runs for a couple of minutes and then nothing happens.
so I then I split the dataframe by species name, and created a list of dataframes(out1), and used lapply(out1, platmaps) but it only returned a list of the names of my dfs.
Then I tried a variation of an example that I saw here, but it also did not work.
function
platmaps<-function(df1){
wm <- wm <- borders("world", colour="gray50", fill="gray50")
ggplot()+
coord_fixed()+
wm +
geom_point(data =df1 , aes(x = decimalLongitude, y = decimalLatitude),
colour = "pink", size = 0.5)
subset
for(i in 1:nrow(PP)){
query<-paste(PP$species[i])
p<-subset(df1, df1$species== query))
platmaps(p)
}
list
for (i in 1:length(out1)){
pp<-out1[[i]]
platmaps(pp)
}
applied example
p =
wm <- wm <- borders("world", colour="gray50", fill="gray50")
ggplot()+
coord_fixed()+
wm +
geom_point(data =df1 , aes(x = decimalLongitude, y = decimalLatitude),
colour = "pink", size = 0.5)
plots = df1 %>%
group_by(species) %>%
do(plots = p %+% . + facet_wrap(~species))
the error for the applied example is:
Error: Cannot add ggproto objects together. Did you forget to add this
object to a ggplot object?
As I'm new to R (and coding), I assume I'm getting the syntax wrong, or am not applying my function correctly to/within either of my loops, or I fundamentally misunderstand the way looping works.
data frame sample
species decimalLongitude decimalLatitude
Platanthera lacera -71.90000 42.80000
Platanthera lacera -90.54861 40.12083
Platanthera lacera -71.00889 42.15500
Platanthera lacera -93.20833 45.20028
Platanthera lacera -72.45833 41.91666
Platanthera bifolia 5.19800 59.64310
Platanthera sparsiflora -117.67472 34.36278
fixed platmaps function
ggplot(data=df1 %>% filter(species == s))+
coord_fixed()+
borders("world", colour="gray50", fill="gray50")+
geom_point(aes(x = decimalLongitude, y = decimalLatitude),
colour = "pink", size = 0.5)+
labs(title=as.character(s))

Because you didn't provide a test data set, let me give you a general idea how to make multiple plots you can inspect later. The code below will plot a parameter for a number of countries and save plot pdfs to a given path. You can replace the code behind the pl variable in the loop with your function.
library(ggplot2)
library(dplyr)
df <- data.frame(country = c(rep('USA',20), rep('Canada',20), rep('Mexico',20)),
wave = c(1:20, 1:20, 1:20),
par = c(1:20 + 5*runif(20), 21:40 + 10*runif(20), 1:20 + 15*runif(20)))
countries <- unique(df$country)
plot_list <- list()
i <- 1
for (c in countries){
pl <- ggplot(data = df %>% filter(country == c)) +
geom_point(aes(wave, par), size = 3, color = 'red') +
labs(title = as.character(c), x = 'wave', y = 'value') +
theme_bw(base_size = 16)
plot_list[[i]] <- pl
i <- i + 1
}
pdf('path/to/pdf')
pdf.options(width = 9, height = 7)
for (i in 1:length(plot_list)){
print(plot_list[[i]])
}
dev.off()
After the plots are obtained (the plot_list variable), we turn on the pdf terminal and print them. In the end, we turn off the pdf terminal.

there is a neat way to apply any function to a list of items. I have outlined a way to do this with the data you added. I cannot get platmaps to work so I have just made a scatter plot.
The method is to split your data frame into individual subsets using split() and then apply the plotting function to the resulting list using lapply(). Since lapply() returns a list, this can be passed directly to a function such as ggpubr::ggarrange() for visualizing.
library(ggplot2)
plot_function <- function(x){
p <- ggplot(x, aes(x = decimalLongitude, y = decimalLatitude)) + geom_point()
p
}
plot_list <-
df %>%
split(.$species) %>% # Separate df into subset dfs based on species column
lapply(., plot_function) # map plot_function to list
# Display on a grid (many ways to do this - I just find this package simple)
ggpubr::ggarrange(plotlist = plot_list)

How to graph "before and after" measures using ggplot with connecting lines and subsets?

I’m totally new to ggplot, relatively fresh with R and want to make a smashing ”before-and-after” scatterplot with connecting lines to illustrate the movement in percentages of different subgroups before and after a special training initiative. I’ve tried some options, but have yet to:
show each individual observation separately (now same values are overlapping)
connect the related before and after measures (x=0 and X=1) with lines to more clearly illustrate the direction of variation
subset the data along class and id using shape and colors
How can I best create a scatter plot using ggplot (or other) fulfilling the above demands?
Main alternative: geom_point()
Here is some sample data and example code using genom_point
x <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1) # 0=before, 1=after
y <- c(45,30,10,40,10,NA,30,80,80,NA,95,NA,90,NA,90,70,10,80,98,95) # percentage of ”feelings of peace"
class <- c(0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1) # 0=multiple days 1=one day
id <- c(1,1,2,3,4,4,4,4,5,6,1,1,2,3,4,4,4,4,5,6) # id = per individual
df <- data.frame(x,y,class,id)
ggplot(df, aes(x=x, y=y), fill=id, shape=class) + geom_point()
Alternative: scale_size()
I have explored stat_sum() to summarize the frequencies of overlapping observations, but then not being able to subset using colors and shapes due to overlap.
ggplot(df, aes(x=x, y=y)) +
stat_sum()
Alternative: geom_dotplot()
I have also explored geom_dotplot() to clarify the overlapping observations that arise from using genom_point() as I do in the example below, however I have yet to understand how to combine the before and after measures into the same plot.
df1 <- df[1:10,] # data before
df2 <- df[11:20,] # data after
p1 <- ggplot(df1, aes(x=x, y=y)) +
geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
binwidth=(1/0.3))
p2 <- ggplot(df2, aes(x=x, y=y)) +
geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
binwidth=(1/0.3))
grid.arrange(p1,p2, nrow=1) # GridExtra package

Or maybe it is better to summarize data by x, id, class as mean/median of y, filter out ids producing NAs (e.g. ids 3 and 6), and connect the points by lines? So in case if you don't really need to show variability for some ids (which could be true if the plot only illustrates tendencies) you can do it this way:
library(ggplot)
library(dplyr)
#library(ggthemes)
df <- df %>%
group_by(x, id, class) %>%
summarize(y = median(y, na.rm = T)) %>%
ungroup() %>%
mutate(
id = factor(id),
x = factor(x, labels = c("before", "after")),
class = factor(class, labels = c("one day", "multiple days")),
) %>%
group_by(id) %>%
mutate(nas = any(is.na(y))) %>%
ungroup() %>%
filter(!nas) %>%
select(-nas)
ggplot(df, aes(x = x, y = y, col = id, group = id)) +
geom_point(aes(shape = class)) +
geom_line(show.legend = F) +
#theme_few() +
#theme(legend.position = "none") +
ylab("Feelings of peace, %") +
xlab("")

Here's one possible solution for you.
First - to get the color and shapes determined by variables, you need to put these into the aes function. I turned several into factors, so the labs function fixes the labels so they don't appear as "factor(x)" but just "x".
To address multiple points, one solution is to use geom_smooth with method = "lm". This plots the regression line, instead of connecting all the dots.
The option se = FALSE prevents confidence intervals from being plotted - I don't think they add a lot to your plot, but play with it.
Connecting the dots is done by geom_line - feel free to try that as well.
Within geom_point, the option position = position_jitter(width = .1) adds random noise to the x-axis so points do not overlap.
ggplot(df, aes(x=factor(x), y=y, color=factor(id), shape=factor(class), group = id)) +
geom_point(position = position_jitter(width = .1)) +
geom_smooth(method = 'lm', se = FALSE) +
labs(
x = "x",
color = "ID",
shape = 'Class'
)

R Highlight point on ecdf line graph

I'm creating a frequency plot using ggplot and the stat_ecdf function. I would like to add the Y-value to the graph for specific X-values, but just can't figure out how. geom_point or geom_text seems likely options, but as stat_ecdf automatically calculates Y, I don't know how to call that value in the geom_point/text mappings.
Sample code for my initial plot is:
x = as.data.frame(rnorm(100))
ggplot(x, aes(x)) +
stat_ecdf()
Now how would I add specific y-x points here, e.g. y-value at x = -1.

The easiest way is to create the ecdf function beforehand using ecdf() from the stats package, then plot it using geom_label().
library(ggplot2)
# create a data.frame with column name
x = data.frame(col1 = rnorm(100))
# create ecdf function
e = ecdf(x$col1)
# plot the result
ggplot(x, aes(col1)) +
stat_ecdf() +
geom_label(aes(x = -1, y = e(-1)),
label = e(-1))

You can try
library(tidyverse)
# data
set.seed(123)
df = data.frame(x=rnorm(100))
# Plot
Values <- c(-1,0.5,2)
df %>%
mutate(gr=FALSE) %>%
bind_rows(data.frame(x=Values,gr=TRUE)) %>%
mutate(y=ecdf(x)(x)) %>%
mutate(xmin=min(x)) %>%
ggplot(aes(x, y)) +
stat_ecdf() +
geom_point(data=. %>% filter(gr), aes(x, y)) +
geom_segment(data=. %>% filter(gr),aes(y=y,x=xmin, xend=x,yend=y), color="red")+
geom_segment(data=. %>% filter(gr),aes(y=0,x=x, xend=x,yend=y), color="red") +
ggrepel::geom_label_repel(data=. %>% filter(gr),
aes(x, y, label=paste("x=",round(x,2),"\ny=",round(y,2))))
The idea is to add the y values in the beginning, together with the index gr specifing which Values you want to show.
Edit:
Since this code adds points to the actual data, which could be wrong for the curve, one should consider to remove these points at least in the ecdf function stat_ecdf(data=. %>% filter(!gr))

ggplot2: Different vlines for each graph using facet_wrap [duplicate]

I've poked around, but been unable to find an answer. I want to do a weighted geom_bar plot overlaid with a vertical line that shows the overall weighted average per facet. I'm unable to make this happen. The vertical line seems to a single value applied to all facets.
require('ggplot2')
require('plyr')
# data vectors
panel <- c("A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
instrument <-c("V1","V2","V1","V1","V1","V2","V1","V1","V2","V1","V1","V2","V1","V1","V2","V1")
cost <- c(1,4,1.5,1,4,4,1,2,1.5,1,2,1.5,2,1.5,1,2)
sensitivity <- c(3,5,2,5,5,1,1,2,3,4,3,2,1,3,1,2)
# put an initial data frame together
mydata <- data.frame(panel, instrument, cost, sensitivity)
# add a "contribution to" vector to the data frame: contribution of each instrument
# to the panel's weighted average sensitivity.
myfunc <- function(cost, sensitivity) {
return(cost*sensitivity/sum(cost))
}
mydata <- ddply(mydata, .(panel), transform, contrib=myfunc(cost, sensitivity))
# two views of each panels weighted average; should be the same numbers either way
ddply(mydata, c("panel"), summarize, wavg=weighted.mean(sensitivity, cost))
ddply(mydata, c("panel"), summarize, wavg2=sum(contrib))
# plot where each panel is getting its overall cost-weighted sensitivity from. Also
# put each panel's weighted average on the plot as a simple vertical line.
#
# PROBLEM! I don't know how to get geom_vline to honor the facet breakdown. It
# seems to be computing it overall the data and showing the resulting
# value identically in each facet plot.
ggplot(mydata, aes(x=sensitivity, weight=contrib)) +
geom_bar(binwidth=1) +
geom_vline(xintercept=sum(contrib)) +
facet_wrap(~ panel) +
ylab("contrib")

If you pass in the presumarized data, it seems to work:
ggplot(mydata, aes(x=sensitivity, weight=contrib)) +
geom_bar(binwidth=1) +
geom_vline(data = ddply(mydata, "panel", summarize, wavg = sum(contrib)), aes(xintercept=wavg)) +
facet_wrap(~ panel) +
ylab("contrib") +
theme_bw()

Example using dplyr and facet_wrap incase anyone wants it.
library(dplyr)
library(ggplot2)
df1 <- mutate(iris, Big.Petal = Petal.Length > 4)
df2 <- df1 %>%
group_by(Species, Big.Petal) %>%
summarise(Mean.SL = mean(Sepal.Length))
ggplot() +
geom_histogram(data = df1, aes(x = Sepal.Length, y = ..density..)) +
geom_vline(data = df2, mapping = aes(xintercept = Mean.SL)) +
facet_wrap(Species ~ Big.Petal)

vlines <- ddply(mydata, .(panel), summarize, sumc = sum(contrib))
ggplot(merge(mydata, vlines), aes(sensitivity, weight = contrib)) +
geom_bar(binwidth = 1) + geom_vline(aes(xintercept = sumc)) +
facet_wrap(~panel) + ylab("contrib")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to cluster within clusters - r

Related

ggforce: geom_mark_ellipse - How to move connectors?

how to loop a geographic mapping function over a list of dataframes (or a subsetted dataframe)

How to graph "before and after" measures using ggplot with connecting lines and subsets?

R Highlight point on ecdf line graph

ggplot2: Different vlines for each graph using facet_wrap [duplicate]

Categories

Resources