I have two spatial datasets with coordinates indicating observations of a species and want to estimate the area of overlap among these datasets. Since point coordinates cannot represent an area, one has to bin the coordinates using similar x (longitude) and y (latitude) categories for both datasets.
For this task, I found the practical hexbin package, which does hexagonal binning. The package is great, but at least I fail to find a function that directly outputs the coordinates / IDs of overlapping bins among hexbin objects. For example, the hdiffplot returns a nice graphical overview of overlapping bins, but how to extract this information for further analysis?
library(hexbin)
set.seed(1); df1 <- data.frame(x = rnorm(10, 0, 5), y = rnorm(10, 0, 5))
set.seed(2); df2 <- data.frame(x = rnorm(10, 0, 5), y = rnorm(10, 0, 5))
xrange <- c(floor(min(c(df1$x, df2$x))-1), ceiling(max(c(df1$x, df2$x))+1))
#-/+1 just to make the plot nicer
yrange <- c(floor(min(c(df1$y, df2$y))-1), ceiling(max(c(df1$y, df2$y)))+1)
hb1 <- hexbin(df1$x, df1$y, xbins = 10, xbnds = xrange, ybnds = yrange)
hb2 <- hexbin(df2$x, df2$y, xbins = 10, xbnds = xrange, ybnds = yrange)
hdiffplot(hb1,hb2, xbnds = xrange, ybnds = yrange)
I figured out a solution to this problem while making the question. Will post it here in hopes that it will help someone one day.
You can extract the coordinates using the hcell2xy function. Here is a little function to find the unique and overlapping coordinates for bin centroids:
#' #title Print overlapping and unique bin centroid coordinates for two hexbin objects
#' #param bin1,bin2 two objects of class hexbin.
#' #details The hexbin objects for comparison, bin1 and bin2, must have the same plotting limits and cell size.
#' #return Returns a list of data frames with unique coordinates for \code{bin1} and \code{bin2} as well as overlapping coordinates among bins.
hdiffcoords <- function(bin1, bin2) {
## Checks modified from: https://github.com/edzer/hexbin/blob/master/R/hdiffplot.R
if(is.null(bin1) | is.null(bin1)) {
stop("Need 2 hex bin objects")
} else {
if(bin1#shape != bin2#shape)
stop("Bin objects must have same shape parameter")
if(all(bin1#xbnds == bin2#xbnds) & all(bin1#ybnds == bin2#ybnds))
equal.bounds <- TRUE
else stop("Bin objects need the same xbnds and ybnds")
if(bin1#xbins != bin2#xbins)
stop("Bin objects need the same number of bins")
}
## Find overlapping and unique bins
hd1 <- data.frame(hcell2xy(bin1), count_bin1 = bin1#count, cell_bin1 = bin1#cell)
hd2 <- data.frame(hcell2xy(bin2), count_bin2 = bin2#count, cell_bin2 = bin2#cell)
overlapping_hd1 <- apply(hd1, 1, function(r, A){ sum(A$x==r[1] & A$y==r[2]) }, hd2)
overlapping_hd2 <- apply(hd2, 1, function(r, A){ sum(A$x==r[1] & A$y==r[2]) }, hd1)
overlaps <- merge(hd1[as.logical(overlapping_hd1),], hd2[as.logical(overlapping_hd2),])
unique_hd1 <- hd1[!as.logical(overlapping_hd1),]
unique_hd2 <- hd2[!as.logical(overlapping_hd2),]
## Return list of data.frames
list(unique_bin1 = unique_hd1, unique_bin2 = unique_hd2, overlapping = overlaps)
}
This information should be the same than returned by hdiffplot in graphical format:
df <- hdiffcoords(hb1, hb2)
library(ggplot2)
ggplot() +
geom_point(data = df$unique_bin1, aes(x = x, y = y), color = "red", size = 10) +
geom_point(data = df$unique_bin2, aes(x = x, y = y), color = "cyan", size = 10) +
geom_point(data = df$overlapping, aes(x = x, y = y), color = "green", size = 10) + theme_bw()
Any comments/corrections are appreciated.
Related
Here is a reproducible example to work with:
library(lattice)
myimage<-matrix(c(1,1,2,3,3,4), nrow=3, ncol=2)
mytable<-data.frame(Xcoord=c(1.5, 1.5, 3,3), Ycoord=c(1,2,1,2), Labels=c("A","B","C","D"))
mycolors<-colorRampPalette(c("red","yellow","green","cyan","blue"))
windows()
levelplot(myimage, aspect="iso", col.regions = mycolors)
which produce the graph below.
Now I want to add (as text) the Labels in mytable at the specified coordinates indicated by Xcoord & Ycoord (which correspond to the rows and columns of the images). How can I do this ?
Only solution I could find is following user20650's link above and converting the matrix image to a data.frame with x, y coordinates (which I would have preferred to avoid):
dat <- data.frame(expand.grid(x = 1:3, y = 1:2), value = c(myimage))
Obj <-
levelplot(value ~ x+y, data = dat, aspect="iso", col.regions = mycolors) +
xyplot(y ~ x, data = dat,
panel = function(y, x, ...) {
ltext(x = mytable$Xcoord, y = mytable$Ycoord, labels = mytable$Labels, cex = 1, font = 2)
})
print({Obj})
Problem:
1.) I have a shapefile that looks like this:
Extreme values for coordinates are: xmin = 300,000, xmax = 620,000, ymin = 31,000 and ymax = 190,000.
2.) I have a dataset of approx. 2mio points (every point is inside the given polygon) - each one is in one of a 5 different categories.
Now, for every point inside the border (distance between points has to be 10, so that would give us 580,800,000 points) I want to determine color, depending on a category of the nearest point in a dataset.
In the end I would like to draw a ggplot, where the color of every point is dependent on its category (so I'll use 5 different colors).
What I have so far:
My ideas for solution are not optimized and it takes R forever to determine categories for every point inside the polygon.
1.) I created a new dataset with points in a shape of a rectangle with extreme values of coordinates, with 10 units between points. From a new dataset I selected points that have fallen inside the border of polygons (with a function pnt.in.poly from package SDMTools). Then I wanted to find nearest points (from dataset) of every point in a polygon and determined category, but I never manage to get a subset from 580,800,000 points (obviously).
2.) I tried to take 2mio points and color an area around them, dependent on their category, but that did not work right.
I know that it is not possible to plot so many points and see the difference between plot with 200,000,000 points and plot with 1,000,000 points, but I would like to have an accurate coloring when zooming (drawing) only one little spot in a polygon (size of 100 x 100 for example).
Question: Is there any better a way of coloring so many points in a polygon (with creating a new shapefile or grouping points)?
Thank you for your ideas!
It’s really helpful if you include some data with your question, even (especially) if it’s a toy data set. As you don’t, I’ve made a toy example. First, I define a simple shape data frame and a data frame of synthetic data that includes x, y, and grp (i.e., a categorical variable with 5 levels). I crop the latter to the former and plot the results,
# Dummy shape function
df_shape <- data.frame(x = c(0, 0.5, 1, 0.5, 0),
y = c(0, 0.2, 1, 0.8, 0))
# Load library
library(ggplot2)
library(sgeostat) # For in.polygon function
# Data frame of synthetic data: random [x, y] and category (grp)
df_synth <- data.frame(x = runif(500),
y = runif(500),
grp = factor(sample(1:5, 500, replace = TRUE)))
# Remove points outside polygon
df_synth <- df_synth[in.polygon(df_synth$x, df_synth$y, df_shape$x, df_shape$y), ]
# Plot shape and synthetic data
g <- ggplot(df_shape, aes(x = x, y = y)) + geom_path(colour = "#FF3300", size = 1.5)
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_synth, aes(x = x, y = y, colour = grp))
g
Next, I create a regular grid and crop that using the polygon.
# Create a grid
df_grid <- expand.grid(x = seq(0, 1, length.out = 50),
y = seq(0, 1, length.out = 50))
# Check if grid points are in polygon
df_grid <- df_grid[in.polygon(df_grid$x, df_grid$y, df_shape$x, df_shape$y), ]
# Plot shape and show points are inside
g <- ggplot(df_shape, aes(x = x, y = y)) + geom_path(colour = "#FF3300", size = 1.5)
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_grid, aes(x = x, y = y))
g
To classify each point on this grid by the nearest point in the synthetic data set, I use knn or k-nearest-neighbours with k = 1. That gives something like this.
# Classify grid points according to synthetic data set using k-nearest neighbour
df_grid$grp <- class::knn(df_synth[, 1:2], df_grid, df_synth[, 3])
# Show categorised points
g <- ggplot()
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_grid, aes(x = x, y = y, colour = grp))
g
So, that's how I'd address that part of your question about classifying points on a grid.
The other part of your question seems to be about resolution. If I understand correctly, you want the same resolution even if you're zoomed in. Also, you don't want to plot so many points when zoomed out, as you can't even see them. Here, I create a plotting function that lets you specify the resolution. First, I plot all the points in the shape with 50 points in each direction. Then, I plot a subregion (i.e., zoom), but keep the same number of points in each direction the same so that it looks pretty much the same as the previous plot in terms of numbers of dots.
res_plot <- function(xlim, xn, ylim, yn, df_data, df_sh){
# Create a grid
df_gr <- expand.grid(x = seq(xlim[1], xlim[2], length.out = xn),
y = seq(ylim[1], ylim[2], length.out = yn))
# Check if grid points are in polygon
df_gr <- df_gr[in.polygon(df_gr$x, df_gr$y, df_sh$x, df_sh$y), ]
# Classify grid points according to synthetic data set using k-nearest neighbour
df_gr$grp <- class::knn(df_data[, 1:2], df_gr, df_data[, 3])
g <- ggplot()
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_gr, aes(x = x, y = y, colour = grp))
g <- g + xlim(xlim) + ylim(ylim)
g
}
# Example plot
res_plot(c(0, 1), 50, c(0, 1), 50, df_synth, df_shape)
# Same resolution, but different limits
res_plot(c(0.25, 0.75), 50, c(0, 1), 50, df_synth, df_shape)
Created on 2019-05-31 by the reprex package (v0.3.0)
Hopefully, that addresses your question.
I'd like to generate a choropleth map using the following data points:
Longitude
Latitude
Price
Here is the dataset - https://www.dropbox.com/s/0s05cl34bko7ggm/sample_data.csv?dl=0.
I would like the map to show the areas where the price is higher and the where price is lower. It should most probably look like this (sample image):
Here is my code:
library(ggmap)
map <- get_map(location = "austin", zoom = 9)
data <- read.csv(file.choose(), stringsAsFactors = FALSE)
data$average_rate_per_night <- as.numeric(gsub("[\\$,]", "",
data$average_rate_per_night))
ggmap(map, extent = "device") +
stat_contour( data = data, geom="polygon",
aes( x = longitude, y = latitude, z = average_rate_per_night,
fill = ..level.. ) ) +
scale_fill_continuous( name = "Price", low = "yellow", high = "red" )
I'm getting the following error message:
2: Computation failed in `stat_contour()`:
Contour requires single `z` at each combination of `x` and `y`.
I'd really appreciate any help on how this can be fixed or any other method to generate this type of heatmap. Please note that I'm interested in the weight of the price, not density of the records.
If you insist on using the contour approach then you need to provide a value for every possible x,y coordinate combination you have in your data. To achieve this I would highly recommend to grid the space and generate some summary statistics per bin.
I attach a working example below based on the data you provided:
library(ggmap)
library(data.table)
map <- get_map(location = "austin", zoom = 12)
data <- setDT(read.csv(file.choose(), stringsAsFactors = FALSE))
# convert the rate from string into numbers
data[, average_rate_per_night := as.numeric(gsub(",", "",
substr(average_rate_per_night, 2, nchar(average_rate_per_night))))]
# generate bins for the x, y coordinates
xbreaks <- seq(floor(min(data$latitude)), ceiling(max(data$latitude)), by = 0.01)
ybreaks <- seq(floor(min(data$longitude)), ceiling(max(data$longitude)), by = 0.01)
# allocate the data points into the bins
data$latbin <- xbreaks[cut(data$latitude, breaks = xbreaks, labels=F)]
data$longbin <- ybreaks[cut(data$longitude, breaks = ybreaks, labels=F)]
# Summarise the data for each bin
datamat <- data[, list(average_rate_per_night = mean(average_rate_per_night)),
by = c("latbin", "longbin")]
# Merge the summarised data with all possible x, y coordinate combinations to get
# a value for every bin
datamat <- merge(setDT(expand.grid(latbin = xbreaks, longbin = ybreaks)), datamat,
by = c("latbin", "longbin"), all.x = TRUE, all.y = FALSE)
# Fill up the empty bins 0 to smooth the contour plot
datamat[is.na(average_rate_per_night), ]$average_rate_per_night <- 0
# Plot the contours
ggmap(map, extent = "device") +
stat_contour(data = datamat, aes(x = longbin, y = latbin, z = average_rate_per_night,
fill = ..level.., alpha = ..level..), geom = 'polygon', binwidth = 100) +
scale_fill_gradient(name = "Price", low = "green", high = "red") +
guides(alpha = FALSE)
You can then play around with the bin size and the contour binwidth to get the desired result but you could additionally apply a smoothing function on the grid to get an even smoother contour plot.
You could use the stat_summary_2d() or stat_summary_hex() function to achieve a similar result. These functions divide the data into bins (defined by x and y), and then the z values for each bin are summarised based on a given function. In the example below I have selected mean as an aggregation function and the map basically shows the average price in each bin.
Note: I needed to treat your average_rate_per_night variable appropriately in order to convert it into numbers (removed the $ sign and the comma).
library(ggmap)
library(data.table)
map <- get_map(location = "austin", zoom = 12)
data <- setDT(read.csv(file.choose(), stringsAsFactors = FALSE))
data[, average_rate_per_night := as.numeric(gsub(",", "",
substr(average_rate_per_night, 2, nchar(average_rate_per_night))))]
ggmap(map, extent = "device") +
stat_summary_2d(data = data, aes(x = longitude, y = latitude,
z = average_rate_per_night), fun = mean, alpha = 0.6, bins = 30) +
scale_fill_gradient(name = "Price", low = "green", high = "red")
Suppose I have the following data frame:
df <- data.frame(A1 = c(0,3.5,0,2.1), A2 =c(0.9,0,0,0.6), A3 = c(0,0.3,0,0.3),A4= c(0,1.9,0,0))
rownames(df) <- names(df)
every element df(i,j) is the strength of relation between ith column and jth row (they are mutually connected, meaning strength between 1 to j is different from strength between j to i). A "0" entry means there is no relation.
Now I would like to draw a circle, with the variables on the perimeter of the circle, and an arrow that shows which variables are connected to each other, and hopefully show the strength of the connection based on the width of the arrow.
So, the final product I wish to be something like this:
Is it even possible to do something like that with ggplot2?
Thanks in advance.
igraph
We start by making a graph from your adjacency matrix:
df <- t(df)
ga <- graph.adjacency(as.matrix(df), weighted = TRUE, mode = "directed")
Then, plot a circle:
par(mar = rep(0.25, 4))
pts <- seq(0, 2*pi, l = 100)
plot(cbind(sin(pts), cos(pts)), type = "l", frame = F, xaxt = "n", yaxt = "n")
Finally, plot the graph:
plot.igraph(ga,
vertex.label = V(ga)$name,
edge.width = E(ga)$weight,
edge.curved = TRUE,
edge.label = E(ga)$weight,
layout = layout_in_circle(ga, order = V(ga)),
add = T)
Output below. You can customize your graph (e.g. curvature and colors of edges, shapes of vertices) as desired.
ggplot2
The main idea is to set up three sets of geoms: the circle, the nodes (vertices), and the lines (edges). First, we load some packages, and prep the circle and nodes:
library(ggplot2)
library(tidyr)
library(dplyr)
# For circle
pts <- seq(0, 2*pi, l = 100)
# For nodes
theta <- seq(0, 2*pi, l = nrow(df) + 1)[1:nrow(df)]
l <- data.frame(x = sin(theta), y = cos(theta), v = names(df),
stringsAsFactors = FALSE)
The edges are a little bit more involved. I make a function to make coordinates for the lines, given an origin and destination:
make_edge <- function(origin, dest, l, shrink = .9) {
# l is the layout matrix for the nodes that we made previously
data.frame(
x0 = l$x[l$v == origin],
y0 = l$y[l$v == origin],
x1 = l$x[l$v == dest],
y1 = l$y[l$v == dest]
) * shrink
}
Then, we make an adjacency graph, and bind the edge coordinates to it:
gr <- gather(mutate(df, dest = names(df)), origin, wt, -dest)
gr <- gr[gr$wt != 0, ]
edges <- do.call(rbind,
mapply(make_edge, gr$origin, gr$dest, list(l), shrink = .94, SIMPLIFY = F)
)
ga <- cbind(gr, edges)
Finally, we plot:
ggplot() +
geom_path(data = data.frame(x = sin(pts), y = cos(pts)), aes(x, y)) +
geom_label(data = l, aes(x, y, label = v)) +
geom_curve(data = ga,
aes(x = x0, y = y0, xend = x1, yend = y1, size = wt, colour = origin),
alpha = 0.8,
curvature = 0.1,
arrow = arrow(length = unit(2, "mm"))) +
scale_size_continuous(range=c(.25,2), guide = FALSE) +
theme_void()
Output:
I wrote a little package that does this kind of thing. Here's a small demo vignette https://github.com/mkearney/lavplot/blob/master/vignettes/demo.Rmd. Image of plot provided below.
I wanna make a scatter plot with connecting lines for different groups and different individuals. I make panels conditioned by my group variable and groups conditioned by my individual variables. Now, I would like to add legend inside each panels(see the code below). In the plots, I would like to have legends of individuals for GRP==1 in the first panel, GRP==2 in the second panel, so on so forth. All the legends are located in the upper left corner of the panel they belong to. How shall I code?
library(lattice)
mydata <- data.frame(ID = rep(1: 20, each = 10),
GRP = rep(1: 4, each = 50),
x = rep(0: 9, 20))
mydata$y <- 1.2 * mydata$GRP * mydata$x +
rnorm(nrow(mydata), sd = mydata$GRP)
xyplot(y~ x | factor(GRP), data = mydata,
groups = ID,
type = "b",
as.table = T,
layout = c(2, 2),
panel = panel.superpose,
panel.groups = function (x, y, ...) {
panel.xyplot(x, y, ...)
}
)
Try something like this. Note that the subset command comes in the data statement in xyplot. This is on purpose. If you call subset as an xyplot argument, then the plots would have shown all 20 labels in each plot.
library(lattice)
mydata <- data.frame(ID = rep(1:20, each = 10), GRP = rep(1:4, each = 50), x = rep(0:9, 20))
mydata$y <- 1.2 * mydata$GRP * mydata$x + rnorm(nrow(mydata), sd = mydata$GRP)
i=1; j=1
for(grp in 1:4) {
a <- xyplot(y~x|factor(GRP), data=subset(mydata, GRP==grp),
groups = factor(ID),
type = "b",
auto.key=list(columns=4,space="inside")
)
print(a, split=c(i,j,2,2), more=T)
i=i+1; if(i>2){i=1;j=j+1} # basically, tell the plots which quadrant to go in
}