Related
I have point cloud data of an area (x,y,z coordinates)
The plot of X and Y looks like:
I am trying to get polygons of different clusters in this data. I tried the following:
points <- df [,1:2] # x and y coordinates
pts <- st_as_sf(points, coords=c('X','Y'))
conc <- concaveman(pts, concavity = 0.5, length_threshold = 0)
Seems like I just get a single polygon binding the whole data. conc$polygons is a list of one variable.
How can I define multiple polygons? What am I missing when I am using concaveman and what all it can provide?
It's hard to tell from your example what variable defines your clusters. Below is an example with some simulated clusters using ggplot2 and data.table (adapted from here).
library(data.table)
library(ggplot2)
# Simulate data:
set.seed(1)
n_cluster = 50
centroids = cbind.data.frame(
x=rnorm(5, mean = 0, sd=5),
y=rnorm(5, mean = 0, sd=5)
)
dt = rbindlist(
lapply(
1:nrow(centroids),
function(i) {
cluster_dt = data.table(
x = rnorm(n_cluster, mean = centroids$x[i]),
y = rnorm(n_cluster, mean = centroids$y[i]),
cluster = i
)
}
)
)
dt[,cluster:=as.factor(cluster)]
# Find convex hull of each point by cluster:
hulls = dt[,.SD[chull(x,y)],by=.(cluster)]
# Plot:
p = ggplot(data = dt, aes(x=x, y=y, colour=cluster)) +
geom_point() +
geom_polygon(data = hulls,aes(fill=cluster,alpha = 0.5)) +
guides(alpha=F)
This produces the following output:
Edit
If you don't have predefined clusters, you can use a clustering algorithm. As a simple example, see below for a solution using kmeans with 5 centroids.
# Estimate clusters (e.g. kmeans):
dt[,km_cluster := as.factor(kmeans(.SD,5)$cluster),.SDcols=c("x","y")]
# Find convex hull of each point:
hulls = dt[,.SD[chull(x,y)],by=.(km_cluster)]
# Plot:
p = ggplot(data = dt, aes(x=x, y=y, colour=km_cluster)) +
geom_point() +
geom_polygon(data = hulls,aes(fill=km_cluster,alpha = 0.5)) +
guides(alpha=F)
In this case the output for the estimated clusters is almost equivalent to the constructed ones.
I would like to reproduce plot of spatial dependency of regions in ggplot2 rather then using basic plot in R
I provided reproduceble example in code below:
I followed example: Plotting neighborhoods network to a ggplot maps
library(leaflet)
library(ggplot2)
library(sf)
library(spdep)
URL <- "https://biogeo.ucdavis.edu/data/gadm3.6/Rsp/gadm36_CZE_1_sp.rds"
data <- readRDS(url(URL))
ggplot() +
geom_polygon(data = data, aes(x=long, y = lat, group = group), color = "black", fill = F)
cns <- knearneigh(coordinates(data), k = 3, longlat=T)
scnsn <- knn2nb(cns, row.names = NULL, sym = T)
cns
scnsn
cS <- nb2listw(scnsn)
summary(cS)
# Plot of regions and k-nn neighthorhours matrix
plot(data)
plot(cS, coordinates(data), add = T)
I am asking how to reproduce Plot of regions and k-nn neighthorhours matrix using ggplot.
I know we have to retrive each point input and then use geom_segment, however I dont know how to retrive it from cS object.
The other SO post you are refering contains all steps you need to follow to get your plot (thanks to the great answer from #StupidWolf).
Basically, you need to extract the different segment using:
1) Transform coordinates of data in a dataframe, it will facilitate its use later:
data_df <- data.frame(coordinates(data))
colnames(data_df) <- c("long", "lat")
This data_df contains now all x,y values for plotting points.
2) Now, we can retrieve segments informations from the cS object using:
n = length(attributes(cS$neighbours)$region.id)
DA = data.frame(
from = rep(1:n,sapply(cS$neighbours,length)),
to = unlist(cS$neighbours),
weight = unlist(cS$weights)
)
DA = cbind(DA, data_df[DA$from,], data_df[DA$to,])
colnames(DA)[4:7] = c("long","lat","long_to","lat_to")
In the DA dataframe, you have all informations required to draw each segments
3) Finally, you can put plot every parts:
ggplot(data, aes(x = long, y =lat))+
geom_polygon(aes(group = group), color = "black", fill = FALSE)+
geom_point(data = data_df, aes(x= long, y = lat), size = 1)+
geom_segment(data = DA, aes(xend = long_to, yend = lat_to), size=0.5)
Again, the solution provided by #StupidWolf was pretty well written and understandable, so I don't know why you were not able to reproduce it.
Problem:
1.) I have a shapefile that looks like this:
Extreme values for coordinates are: xmin = 300,000, xmax = 620,000, ymin = 31,000 and ymax = 190,000.
2.) I have a dataset of approx. 2mio points (every point is inside the given polygon) - each one is in one of a 5 different categories.
Now, for every point inside the border (distance between points has to be 10, so that would give us 580,800,000 points) I want to determine color, depending on a category of the nearest point in a dataset.
In the end I would like to draw a ggplot, where the color of every point is dependent on its category (so I'll use 5 different colors).
What I have so far:
My ideas for solution are not optimized and it takes R forever to determine categories for every point inside the polygon.
1.) I created a new dataset with points in a shape of a rectangle with extreme values of coordinates, with 10 units between points. From a new dataset I selected points that have fallen inside the border of polygons (with a function pnt.in.poly from package SDMTools). Then I wanted to find nearest points (from dataset) of every point in a polygon and determined category, but I never manage to get a subset from 580,800,000 points (obviously).
2.) I tried to take 2mio points and color an area around them, dependent on their category, but that did not work right.
I know that it is not possible to plot so many points and see the difference between plot with 200,000,000 points and plot with 1,000,000 points, but I would like to have an accurate coloring when zooming (drawing) only one little spot in a polygon (size of 100 x 100 for example).
Question: Is there any better a way of coloring so many points in a polygon (with creating a new shapefile or grouping points)?
Thank you for your ideas!
It’s really helpful if you include some data with your question, even (especially) if it’s a toy data set. As you don’t, I’ve made a toy example. First, I define a simple shape data frame and a data frame of synthetic data that includes x, y, and grp (i.e., a categorical variable with 5 levels). I crop the latter to the former and plot the results,
# Dummy shape function
df_shape <- data.frame(x = c(0, 0.5, 1, 0.5, 0),
y = c(0, 0.2, 1, 0.8, 0))
# Load library
library(ggplot2)
library(sgeostat) # For in.polygon function
# Data frame of synthetic data: random [x, y] and category (grp)
df_synth <- data.frame(x = runif(500),
y = runif(500),
grp = factor(sample(1:5, 500, replace = TRUE)))
# Remove points outside polygon
df_synth <- df_synth[in.polygon(df_synth$x, df_synth$y, df_shape$x, df_shape$y), ]
# Plot shape and synthetic data
g <- ggplot(df_shape, aes(x = x, y = y)) + geom_path(colour = "#FF3300", size = 1.5)
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_synth, aes(x = x, y = y, colour = grp))
g
Next, I create a regular grid and crop that using the polygon.
# Create a grid
df_grid <- expand.grid(x = seq(0, 1, length.out = 50),
y = seq(0, 1, length.out = 50))
# Check if grid points are in polygon
df_grid <- df_grid[in.polygon(df_grid$x, df_grid$y, df_shape$x, df_shape$y), ]
# Plot shape and show points are inside
g <- ggplot(df_shape, aes(x = x, y = y)) + geom_path(colour = "#FF3300", size = 1.5)
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_grid, aes(x = x, y = y))
g
To classify each point on this grid by the nearest point in the synthetic data set, I use knn or k-nearest-neighbours with k = 1. That gives something like this.
# Classify grid points according to synthetic data set using k-nearest neighbour
df_grid$grp <- class::knn(df_synth[, 1:2], df_grid, df_synth[, 3])
# Show categorised points
g <- ggplot()
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_grid, aes(x = x, y = y, colour = grp))
g
So, that's how I'd address that part of your question about classifying points on a grid.
The other part of your question seems to be about resolution. If I understand correctly, you want the same resolution even if you're zoomed in. Also, you don't want to plot so many points when zoomed out, as you can't even see them. Here, I create a plotting function that lets you specify the resolution. First, I plot all the points in the shape with 50 points in each direction. Then, I plot a subregion (i.e., zoom), but keep the same number of points in each direction the same so that it looks pretty much the same as the previous plot in terms of numbers of dots.
res_plot <- function(xlim, xn, ylim, yn, df_data, df_sh){
# Create a grid
df_gr <- expand.grid(x = seq(xlim[1], xlim[2], length.out = xn),
y = seq(ylim[1], ylim[2], length.out = yn))
# Check if grid points are in polygon
df_gr <- df_gr[in.polygon(df_gr$x, df_gr$y, df_sh$x, df_sh$y), ]
# Classify grid points according to synthetic data set using k-nearest neighbour
df_gr$grp <- class::knn(df_data[, 1:2], df_gr, df_data[, 3])
g <- ggplot()
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_gr, aes(x = x, y = y, colour = grp))
g <- g + xlim(xlim) + ylim(ylim)
g
}
# Example plot
res_plot(c(0, 1), 50, c(0, 1), 50, df_synth, df_shape)
# Same resolution, but different limits
res_plot(c(0.25, 0.75), 50, c(0, 1), 50, df_synth, df_shape)
Created on 2019-05-31 by the reprex package (v0.3.0)
Hopefully, that addresses your question.
This should be fairly easy but I can't find my way thru.
tri_fill <- structure(
list(x= c(0.75, 0.75, 2.25, 3.25),
y = c(40, 43, 43, 40)),
.Names = c("x", "y"),
row.names = c(NA, -4L), class = "data.frame",Integrated=NA, Related=NA)
# install.packages("ggplot2", dependencies = TRUE)
require(ggplot2)
ggplot(data=tri_fill,aes(x=x, y=y))+
geom_polygon() +
scale_fill_gradient(limits=c(1, 4), low = "lightgrey", high = "red")
What I want is a gradient along the x-axis, but with the above I only get a legend with a gradient and the polygon with solid fill.
Here is a possible solution for when you have a relatively simple polygon. Instead of a polygon, we create lots of line-segments and color them by a gradient. The result will thus look like a polygon with a gradient.
#create data for 'n'segments
n_segs <- 1000
#x and xend are sequences spanning the entire range of 'x' present in the data
newpolydata <- data.frame(xstart=seq(min(tri_fill$x),max(tri_fill$x),length.out=n_segs))
newpolydata$xend <- newpolydata$xstart
#y's are a little more complicated: when x is below changepoint, y equals max(y)
#but when x is above the changepoint, the border of the polygon
#follow a line according to the formula y= intercept + x*slope.
#identify changepoint (very data/shape dependent)
change_point <- max(tri_fill$x[which(tri_fill$y==max(tri_fill$y))])
#calculate slope and intercept
slope <- (max(tri_fill$y)-min(tri_fill$y))/ (change_point - max(tri_fill$x))
intercept <- max(tri_fill$y)
#all lines start at same y
newpolydata$ystart <- min(tri_fill$y)
#calculate y-end
newpolydata$yend <- with(newpolydata, ifelse (xstart <= change_point,
max(tri_fill$y),intercept+ (xstart-change_point)*slope))
p2 <- ggplot(newpolydata) +
geom_segment(aes(x=xstart,xend=xend,y=ystart,yend=yend,color=xstart)) +
scale_color_gradient(limits=c(0.75, 4), low = "lightgrey", high = "red")
p2 #note that I've changed the lower border of the gradient.
EDIT: above solution works if one only desires a polygon with a gradient, however, as was pointed out in the comments this can give problems when you were planning to map one thing to fill and another thing to color, as each 'aes' can only be used once. Therefore I have modified the solution to not plot lines, but to plot (very thin) polygons which can have a fill aes.
#for each 'id'/polygon, four x-variables and four y-variable
#for each polygon, we start at lower left corner, and go to upper left, upper right and then to lower right.
n_polys <- 1000
#identify changepoint (very data/shape dependent)
change_point <- max(tri_fill$x[which(tri_fill$y==max(tri_fill$y))])
#calculate slope and intercept
slope <- (max(tri_fill$y)-min(tri_fill$y))/ (change_point - max(tri_fill$x))
intercept <- max(tri_fill$y)
#calculate sequence of borders: x, and accompanying lower and upper y coordinates
x_seq <- seq(min(tri_fill$x),max(tri_fill$x),length.out=n_polys+1)
y_max_seq <- ifelse(x_seq<=change_point, max(tri_fill$y), intercept + (x_seq - change_point)*slope)
y_min_seq <- rep(min(tri_fill$y), n_polys+1)
#create polygons/rectangles
poly_list <- lapply(1:n_polys, function(p){
res <- data.frame(x=rep(c(x_seq[p],x_seq[p+1]),each=2),
y = c(y_min_seq[p], y_max_seq[p:(p+1)], y_min_seq[p+1]))
res$fill_id <- x_seq[p]
res
}
)
poly_data <- do.call(rbind, poly_list)
#plot, allowing for both fill and color-aes
p3 <- ggplot(tri_fill, aes(x=x,y=y))+
geom_polygon(data=poly_data, aes(x=x,y=y, group=fill_id,fill=fill_id)) +
scale_fill_gradient(limits=c(0.75, 4), low = "lightgrey", high = "red") +
geom_point(aes(color=factor(y)),size=5)
p3
My initial goal was to plot a population of individual points and then draw a convex hull enclosing 80% of that population centered on the mass of the population.
After trying a number of ideas, the best solution I came up with was to use ggplot's stat_density2d. While this works great for a qualitative analysis, I still need to indicate an 80% boundary. I started out looking for a way to outline the 80th percentile population boundary, but I can work with an 80% probability density boundary instead.
Here's where I'm looking for help. The bin parameter for kde2d (used by stat_density2d) is not clearly documented. If I set bin = 4 in the example below, am I correct in interpreting the central (green) region as containing a 25% probability mass and the combined yellow, red, and green areas as representing a 75% probability mass? If so, by changing the bin to = 5, would the area inscribed then equal an 80% probability mass?
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
TestData <- ggplot (data = df) +
stat_density2d(aes(x = x, y = y, fill = as.factor(..level..)),
bins=4, geom = "polygon", ) +
geom_point(aes(x = x, y = y)) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "black"))
TestData
I repeated a number of test cases and manually counted the excluded points [would love to find a way to count them based on what ..level.. they were contained within] but given the random nature of the data (both my real data and the test data) the number of points outside of the stat_density2d area varied enough to warrant asking for help.
Summarizing, is there a practical means of drawing a polygon around the central 80% of the population of points in the data frame? Or, baring that, am I safe to use stat_density2d and set bin equal to 5 to produce an 80% probability mass?
Excellent answer from Bryan Hanson dispelling the fuzzy notion that I could pass an undocumented bin parameter in stat_density2d. The results looked close at values for bin around 4 to 6, but as he stated, the actual function is unknown and therefore not usable.
I used the HDRegionplot as provided in the accepted answer by DWin to solve my problem. To that, I added a center of gravity (COGravity) and point in polygon (pnt.in.poly) from the SDMTools package to complete the analysis.
library(MASS)
library(coda)
library(SDMTools)
library(emdbook)
library(ggplot2)
theme_set(theme_bw(16))
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
HPDregionplot(mcmc(data.matrix(df)), prob=0.8)
with(df, points(x,y))
ContourLines <- as.data.frame(HPDregionplot(mcmc(data.matrix(df)), prob=0.8))
df$inpoly <- pnt.in.poly(df, ContourLines[, c("x", "y")])$pip
dp <- df[df$inpoly == 1,]
COG100 <- as.data.frame(t(COGravity(df$x, df$y)))
COG80 <- as.data.frame(t(COGravity(dp$x, dp$y)))
TestData <- ggplot (data = df) +
stat_density2d(aes(x = x, y = y, fill = as.factor(..level..)),
bins=5, geom = "polygon", ) +
geom_point(aes(x = x, y = y, colour = as.factor(inpoly)), alpha = 1) +
geom_point(data=COG100, aes(COGx, COGy),colour="white",size=2, shape = 4) +
geom_point(data=COG80, aes(COGx, COGy),colour="green",size=4, shape = 3) +
geom_polygon(data = ContourLines, aes(x = x, y = y), color = "blue", fill = NA) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "brown", "black", "white", "black", "white","black")) +
scale_colour_manual(values = c("red", "black"))
TestData
nrow(dp)/nrow(df) # actual number of population members inscribed within the 80% probability polgyon
Alright, let me start by saying I'm not entirely sure of this answer, and it's only a partial answer! There is no bin parameter for MASS::kde2d which is the function used by stat_density2d. Looking at the help page for kde2d and the code for it (seen simply by typing the function name in the console), I think the bin parameter is h (how these functions know to pass bin to h is not clear however). Following the help page, we see that if h is not provided, it is computed by MASS:bandwidth.nrd. The help page for that function says this:
# The function is currently defined as
function(x)
{
r <- quantile(x, c(0.25, 0.75))
h <- (r[2] - r[1])/1.34
4 * 1.06 * min(sqrt(var(x)), h) * length(x)^(-1/5)
}
Based on this, I think the answer to your last question ("Am I safe...") is definitely no. r in the above function is what you need for your assumption to be safe, but it is clearly modified, so you are not safe. HTH.
Additional thought: Do you have any evidence that your code is using your bins argument? I'm wondering if it is being ignored. If so, try passing h in place of bins and see if it listens.
HPDregionplot in package:emdbook is supposed to do that. It does use MASS::kde2d but it normalizes the result. It has the disadvantage to my mind that it requires an mcmc object.
library(MASS)
library(coda)
HPDregionplot(mcmc(data.matrix(df)), prob=0.8)
with(df, points(x,y))
Building on the answer by 42, I've simplified HPDregionplot() to reduce dependencies and remove the requirement to work with mcmc-objects. The function works on a two-column data.frame and creates no intermediate plots. Note, however, that the this approach breaks as soon as grDevices::contourLines() return multiple contours.
hpd_contour <- function (x, n = 50, prob = 0.95, ...) {
post1 <- MASS::kde2d(x[[1]], x[[2]], n = n, ...)
dx <- diff(post1$x[1:2])
dy <- diff(post1$y[1:2])
sz <- sort(post1$z)
c1 <- cumsum(sz) * dx * dy
levels <- sapply(prob, function(x) {
approx(c1, sz, xout = 1 - x)$y
})
as.data.frame(grDevices::contourLines(post1$x, post1$y, post1$z, levels = levels))
}
theme_set(theme_bw(16))
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
ContourLines <- hpd_contour(df, prob=0.8)
ggplot(df, aes(x = x, y = y)) +
stat_density2d(aes(fill = as.factor(..level..)), bins=5, geom = "polygon") +
geom_point() +
geom_polygon(data = ContourLines, color = "blue", fill = NA) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "brown", "black", "white", "black", "white","black")) +
scale_colour_manual(values = c("red", "black"))
Moreover, the workflow now easily extends to grouped data.
ContourLines <- iris[, c("Species", "Sepal.Length", "Sepal.Width")] %>%
group_by(Species) %>%
do(hpd_contour(.[, c("Sepal.Length", "Sepal.Width")], prob=0.8))
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.6) +
geom_polygon(data = ContourLines, fill = NA) +
guides(color = FALSE) +
theme(plot.margin = margin())