Mean aggregation in R (Polygon in Polygon) - r

I have a set of polygons that represent the unit of analysis (gadmpolys).
In addition I have a set of polygons with levels of various variables (r3mergepolys).
What I want to accomplish is to aggregate the mean of one or more variables from the polygons (from r3mergepolys) that intersect with the unit of analysis polygons (gadmpolys).
I believe the over and/or aggregate function are my friends, but I cannot seem to figure out how to write the code.
# gadmpolys is the spdf containing my units of analysis
# r3mergepoly is the spdf with many smaller polygons which I want to aggregate from
r3mergepoly <- SpatialPolygonsDataFrame(Sr=r3polys, data=r3merge, match.ID=TRUE)
# Overlay GADMpolys and Afrobarometer-GADM matched polygons. Aggregate survey results for intersecting polygons
gadmpoly_r3 <- over(gadmpoly, r3mergepoly[17:21], fn=mean)

Quick and ugly centroid-based work-around.
B <- SpatialPointsDataFrame(gCentroid(poly.pr, byid=TRUE),poly.pr#data, match.ID=FALSE)
plot(A)
points(poly_centroids)
# Overlay points and extract just the code column:
a.data <- over(A, B[,"code"])
# Add that data back to A:
A$bcode <- a.data$code
The sf package implementation of aggregate also provides a working example of using aggregate
m1 = cbind(c(0, 0, 1, 0), c(0, 1, 1, 0))
m2 = cbind(c(0, 1, 1, 0), c(0, 0, 1, 0))
pol = st_sfc(st_polygon(list(m1)), st_polygon(list(m2)))
set.seed(1985)
d = data.frame(matrix(runif(15), ncol = 3))
p = st_as_sf(x = d, coords = 1:2)
plot(pol)
plot(p, add = TRUE)
(p_ag1 = aggregate(p, pol, mean))
plot(p_ag1) # geometry same as pol
# works when x overlaps multiple objects in 'by':
p_buff = st_buffer(p, 0.2)
plot(p_buff, add = TRUE)
(p_ag2 = aggregate(p_buff, pol, mean)) # increased mean of second
# with non-matching features
m3 = cbind(c(0, 0, -0.1, 0), c(0, 0.1, 0.1, 0))
pol = st_sfc(st_polygon(list(m3)), st_polygon(list(m1)), st_polygon(list(m2)))
(p_ag3 = aggregate(p, pol, mean))
plot(p_ag3)
# In case we need to pass an argument to the join function:
(p_ag4 = aggregate(p, pol, mean,
join = function(x, y) st_is_within_distance(x, y, dist = 0.3)))

Related

R Optimisation of the calculation of the geographical distance between a large number of polygons (>11.000)

How can I optimise in R the calculation of the geographical distance between millions of pairs of centroids of polygons?
The polygons represent 111 km x 111 km grid cells covering the entire Earth.
I'm using the st_distance R function. But the high number of polygons (>11,000) suppose a computational challenge. Any suggestions on how to optimize it? In terms of accuracy, it does not need to be overly precise.
Toy code:
# Create a SpatialPolygonsDataFrame with five polygons
polygons <- st_as_sfc(list(
st_polygon(list(cbind(c(0, 0, 1, 1, 0), c(0, 1, 1, 0, 0)))),
st_polygon(list(cbind(c(1, 1, 2, 2, 1), c(0, 1, 1, 0, 0)))),
st_polygon(list(cbind(c(2, 2, 3, 3, 2), c(0, 1, 1, 0, 0)))),
st_polygon(list(cbind(c(0, 0, -1, -1, 0), c(0, -1, -1, 0, 0)))),
st_polygon(list(cbind(c(-1, -1, -2, -2, -1), c(0, -1, -1, 0, 0))))
))
st_crs(polygons)=4326
data <- data.frame(ID = 1:5, Name = c("A", "B", "C", "D", "E"))
polygons <- st_sf(polygons, data)
# Get the centroids of the polygons and calculate the distance
centroids <- st_centroid(polygons$polygons)
distance <- st_distance(centroids)
Thanks in advance
There are several packages available in R that allow you to compute a distance matrix using various distance functions (e.g. Haversine, Vincenty, geodesic). Here is a comparison of 4 packages and {geodist} seems to be the fastest. Note that the distance results are different.
library("sf")
library("terra")
library("geodist")
library("geosphere")
n = 4000
df = data.frame(x = runif(n, -180, 180), y = runif(n, -90, 90))
pts_sf = st_as_sf(df, coords = c("x", "y"), crs = "epsg:4326")
pts_terra = vect(df, geom = c("x", "y"), crs = "epsg:4326")
t = bench::mark(
iterations = 5, check = FALSE,
sf = st_distance(pts_sf),
terra = as.matrix(terra::distance(pts_terra)),
geodist = geodist(df, measure = "haversine"),
geosphere = distm(df, fun = distHaversine)
)
t[, 1:5]
#> expression min median `itr/sec` mem_alloc
#> 1 sf 20.57s 20.76s 0.0480 125.77MB
#> 2 terra 13.02s 13.11s 0.0764 579.85MB
#> 3 geodist 791.53ms 806.79ms 1.24 244.34MB
#> 4 geosphere 2.68s 2.83s 0.351 2.81GB
Depending on scale and required accuracy, you could st_transform your coordinates to an equidistant / equal-area projection.
Then, round your centroid coordinates and convert to integer (this will return your coordinates in meters, for finer resolution convert to dm or similar before; the expected performance increase comes from using integers together with dist).
Finally use dist to obtain a distance matrix. Using your example data polygons:
df <-
polygons |>
st_transform(3035) |> ## Lambert equal area, picked randomly
rowwise() |>
mutate(coords = polygons |>
st_centroid() |>
st_coordinates(),
x = coords[1], y = coords[2]
) |>
as.data.frame() |>
select(Name, x, y) |>
mutate(across(x:y, ~ round(.x, 0) |> as.integer()))
set unique rownames to identify centroids in the distance matrix later on:
rownames(df) <- df$Name
> df
Name x y
A A 3150682 -2248929
B B 3273461 -2261293
C C 3396377 -2272283
D D 3022547 -2334767
E E 2899563 -2319670
calculate distance:
df |> select(x:y) |> dist()
A B C D
B 123400.0
C 246802.4 123406.3
D 154229.5 261450.3 379016.0
E 260892.8 378427.8 499068.8 123907.2

Adding significance bars within and between groups in dodged ggplot2 boxplots

I have some data that I would like to 1) plot as grouped boxplots, and 2) add significance bars A) between boxplots within each group and B) between specific boxplots of different groups. My data looks something like this:
library("ggplot2")
df <- data.frame(enzyme = c(rep("A", 9), rep("B", 9), rep("C", 9)),
substrate = c(rep("1", 3), rep("2", 3), rep("3", 3),
rep("1", 3), rep("4", 3), rep("5", 3),
rep("1", 3), rep("4", 3), rep("5", 3)),
AUC = c(6.64, 6.56, 6.21, 5.96, 6.12, 6.24, 6.02, 6.32, 6.12,
0, 0, 0, 5.99, 6.26, 5.94, 0, 0, 0,
0, 0, 0, 5.99, 6.11, 6.13, 0, 0, 0))
q <- ggplot(df, aes(x = enzyme, y = AUC, color = substrate)) +
geom_boxplot(show.legend = F,
position = position_dodge2(width = 0.75, preserve = "single")) +
geom_point(show.legend = F, size = 2, position = position_dodge2(width = 0.75, preserve = "single"))
plot(q)
I know that I can add significance bars between groups with the following:
q + geom_signif(comparisons = list(c("A", "B"), c("A", "C"), c("B", "C")),
test = "t.test", map_signif_level = T)
However, these comparisons are not meaningful for my data.
Instead, I would like to A) add significance bars between boxplots of the same group. I thought I could follow the suggestion of Simon, who suggested that I manually add bars by defining p-values, labels, and y coordinates for the bars (How to add significance bar between subgroups of box plot), though for my dataset this will be more difficult because I have three subgroups per group rather than two.
Ultimately, I would also like to B) add significance bars comparing two specific subgroups from different groups.
My question is, is there any easy way to do this using existing functions/packages? If I have to do this manually, can anyone suggest a good strategy? I would appreciate it!
I thought about this for a bit and figured out a lengthy solution. If anyone has a more succinct way of doing this, please let me know!
## significance bars within and between subgroups
# rearrange df, one unique sample per column, rows are replicates
df.split <- do.call(cbind, sapply(split(df, df$enzyme), function(x) {
sapply(split(x, x$substrate), function(x) {x$AUC}) }) )
# keep track of sample names
sample.names <- do.call(c, lapply(split(df, df$enzyme), function(x) {
unique(paste0(x$enzyme, ".", x$substrate)) }) )
colnames(df.split) <- sample.names
# perform statistical tests between every pairwise combination of
# samples/columns in df.split
df.tests <- apply(combn(seq_along(sample.names), 2), 2,
function(x) {
t.test(df.split[ ,x[1]], df.split[ ,x[2]])$p.value })
# keep track of sample pairs
sample.pairs <- apply(combn(seq_along(sample.names), 2), 2,
function(x) {
paste0(colnames(df.split)[x[1]], "X",
colnames(df.split)[x[2]]) })
names(df.tests) <- sample.pairs
# think about how the significance bars will be laid out: because there are
# three subgroups per enzyme, the bars for the three pairwise comparisons on
# the same plot would overlap. This needs to be done in layers
# select tests of interest for each layer
within.tests.1 <- c("A.1XA.2", "A.2XA.3",
"B.1XB.4", "B.4XB.5",
"C.1XC.4", "C.4XC.5")
within.tests.2 <- c("A.1XA.3", "B.1XB.5","C.1XC.5")
between.tests.1 <- c("A.1XB.4", "B.4XC.4")
between.tests.2 <- c("A.1XC.4")
p.values.1 <- df.tests[which(names(df.tests) %in% within.tests.1)]
p.values.2 <- df.tests[which(names(df.tests) %in% within.tests.2)]
p.values.3 <- df.tests[which(names(df.tests) %in% between.tests.1)]
p.values.4 <- df.tests[which(names(df.tests) %in% between.tests.2)]
# convert p-values into easily read labels, with NaN values omitted
p.values.1 <- replace(p.values.1, is.na(p.values.1), 1)
p.values.2 <- replace(p.values.2, is.na(p.values.2), 1)
p.values.3 <- replace(p.values.3, is.na(p.values.3), 1)
p.values.4 <- replace(p.values.4, is.na(p.values.4), 1)
labels.1 <- symnum(p.values.1, corr = FALSE, cutpoints = c(0, .001,.01,.05, 1),
symbols = c("***","**","*",""))
labels.2 <- symnum(p.values.2, corr = FALSE, cutpoints = c(0, .001,.01,.05, 1),
symbols = c("***","**","*",""))
labels.3 <- symnum(p.values.3, corr = FALSE, cutpoints = c(0, .001,.01,.05, 1),
symbols = c("***","**","*",""))
labels.4 <- symnum(p.values.4, corr = FALSE, cutpoints = c(0, .001,.01,.05, 1),
symbols = c("***","**","*",""))
# determine coordinates for significance bars
# y values for layer 1 should all be just above the highest data point of all
# samples being compared
y.values.1 <- do.call(max, lapply(unlist(strsplit(names(labels.1), "X")),
function(x) {
df.split[, which(colnames(df.split) %in% x)] }) ) + 0.3 %>%
rep(times = length(labels.1))
# y values for layer 2 should be higher than those of layer 1
y.values.2 <- y.values.1[c(1, 3, 5)] + 0.4
# y values for layer 3 should all be above the highest data point of all
# samples being compared, and higher than layer 2
y.values.3 <- do.call(max, lapply(unlist(strsplit(names(labels.3), "X")),
function(x) {
df.split[, which(colnames(df.split) %in% x)] }) ) + 1.2 %>%
rep(times = length(labels.3))
# y values for layer 4 should be higher than those of layer 3
y.values.4 <- y.values.3[1] + 0.5
# for x values, first boxplot is always at x = 1
# since there are three groups per x = 1 and preserve = "single", the width of
# each subgroup boxplot is 0.25
x.min.values.1 <- c(0.75, 1, 1.75, 2, 2.75, 3)
x.max.values.1 <- x.min.values.1 + 0.25
x.min.values.2 <- c(0.75, 1.75, 2.75)
x.max.values.2 <- x.min.values.2 + 0.50
x.min.values.3 <- c(0.75, 2)
x.max.values.3 <- c(2, 3)
x.min.values.4 <- c(0.75)
x.max.values.4 <- c(3)
# finally, plot the significance bars for each layer, one on top of the other
q + geom_signif(y_position = y.values.1,
xmin = x.min.values.1,
xmax = x.max.values.1,
annotations = labels.1,
tip_length = rep(0.02, length(labels.1)),
vjust = 0.5 ) +
geom_signif(y_position = y.values.2,
xmin = x.min.values.2,
xmax = x.max.values.2,
annotations = labels.2,
tip_length = rep(0.04, length(labels.2)),
vjust = 0.5 ) +
geom_signif(y_position = y.values.3,
xmin = x.min.values.3,
xmax = x.max.values.3,
annotations = labels.3,
tip_length = rep(0.04, length(labels.3)),
vjust = 0.5 ) +
geom_signif(y_position = y.values.4,
xmin = x.min.values.4,
xmax = x.max.values.4,
annotations = labels.4,
tip_length = rep(0.06, length(labels.4)),
vjust = 0.5 )
The output looks like this:
Barplot_with_significance_bars_within_and_between_groups

How to create a graph from an adjacency matrix by also specifying nodes coordinates in igraph?

I have the following R code:
library('igraph')
nodes <- c('a','b','c','d')
x <- c(0,1,2,3)
y <- c(0,1,2,3)
from <- c('a','b','c')
to <- c('b','c','d')
NodeList <- data.frame(nodes, x ,y)
EdgeList <- data.frame(from, to)
plot(graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = FALSE))
Which emits the graph I want. However I need to be able to use the adjacency matrix instead of from and to vectors. Function graph_from_adjacency_matrix does not include a parameter to specify the coordinates of nodes. How to achieve this?
[As #user20650 mentioned above], you can specify the coordinates of the vertices latter, using vertex_attr():
library('igraph')
adjm <- matrix(
c(0, rep(c(1, 0, 0, 0, 0), times = 3)), 4, , F, list(letters[1:4], letters[1:4])
)
g01 <- graph_from_adjacency_matrix(adjm, 'undirected') -> g02
vertex_attr(g02, name = 'x') <- c(0, 0, 1, 1)
vertex_attr(g02, name = 'y') <- c(0, 1, 1, 0)
par(mfrow = 1:2)
plot(g01)
plot(g02)

Density distributions in R

An assignment has tasked us with creating a series of variables: normal1, normal2, normal3, chiSquared1 and 2, t, and F. They are defined as follows:
library(tibble)
Normal.Frame <- data_frame(normal1 = rnorm(5000, 0, 1),
normal2 = rnorm(5000, 0, 1),
normal3 = rnorm(5000, 0, 1),
chiSquared1 = normal1^2,
chiSquared2 = normal2^2,
F = sum(chiSquared1/chiSquared2),
t = sum(normal3/sqrt(chiSquared1 )))
We then have to make histograms of the distributions for normal1, chiSquared1 and 2, t, and F, which is simple enough for normal1 and the chiSquared variables, but when I try to plot F and t, the plot space is blank.
Our lecturer recommended limiting the range of F to 0-10, and t to -5 to 5. To do this, I use:
HistT <- hist(Normal.Frame$t, xlim = c(-5, 5))
HistF <- hist(Normal.Frame$F, xlim = c(0, 10))
Like I mentioned, this yields blank plots.
Your t and F are defined as sums; they will be single values. If those values are outside your range, the histogram will be empty. If you remove the sum() function you should get the desired results.

Dissolving hexmap polygon shape files

I am trying to produce an outline for a hexagonal cartogram by dissolving the inner polygons via the unionSpatialPolygons or aggregate functions. I am getting stray hexs that do not dissolve... a dummy example to show the problem:
# grab a dummy example shape file
library(raster)
g <- getData(name = "GADM", country = "GBR", level = 2)
# par(mar = rep(0,4))
# plot(g)
# create a hexagonal cartogram
# library(devtools)
# install_github("sassalley/hexmapr")
library(hexmapr)
h <- calculate_cell_size(shape = g, seed = 1,
shape_details = get_shape_details(g),
learning_rate = 0.03, grid_type = 'hexagonal')
i <- assign_polygons(shape = g, new_polygons = h)
par(mar = rep(0,4))
plot(i)
# dissolve the polygons to get coastline
library(maptools)
j <- unionSpatialPolygons(SpP = i, IDs = rep(1, length(i)))
par(mar = rep(0,4))
plot(j)
# same result with aggregate in the raster package
k <- aggregate(x = i)
par(mar = rep(0,4))
plot(k)
With the shapefile I am actually using (not for the UK) I get even more stray hexagons - some complete - some not.
Suggested solution from Roger Bivand (via an email exchange):
g1 <- spTransform(x = g, CRSobj = CRS("+init=epsg:27700"))
# cellsize from calculate_cell_size() above
h1 <- spsample(x = g1, type="hexagonal", cellsize=38309)
i2 <- HexPoints2SpatialPolygons(hex = h1)
j2 <- unionSpatialPolygons(SpP = i2, IDs = rep(1, length(i2)))
plot(j2)
i.e. avoid assign_polygons() in hexmapr and utilize 1) spsample to generate shape positions and 2) HexPoints2SpatialPolygons for the hexagonal grid (both in sp package).

Resources