I try to find groups of points within a radius of 300 meters that gather the highest amount. I am looking for the coordinates of this point. Note that the center point of the area that gather the highest amount has no reason to be a point in data frame observations.
I have the following data:
observations <- spatialrisk::insurance %>%
dplyr::select(amount, lon, lat)
The function spatialrisk::concentration determines the concentration for all target points (i.e. sub):
spatialrisk::concentration(sub = observations,
full = observations,
value = amount, radius = 300)
The function is written in C++ (Rcpp), and is therefore fast. However, the approach is not 'smart'.
Any ideas for a faster solution with the raster (or velox) package? Or with a kernel density approach.
Related
I would like to group polygons together based on a distance criteria:
Any polygon within a certain distance (1200 metres or less) of an origin polygon are grouped together
If other polygons are within the same distance (1200 metres or less) of these 'neighbouring' polygons they are added to this same group
The process for this group continues until no further polygons are added (because they are all further than 1200 metres away).
The next ungrouped polygon is selected and the process repeats for a new grouping
Polygons with no neighbour within 1200 metres are assigned to be in a group by themselves
A polygon should only belong to one group
The final output would be a table with the single polygon ID (UID) and the group ID it belongs to (GrpID) and the average distance between the polygons in that group
I am sure a distance matrix with st_distance means this is possible, but I'm just not getting it.
library(sf)
library(dplyr)
download.file("https://drive.google.com/uc?export=download&id=1-I4F2NYvFWkNqy7ASFNxnyrwr_wT0lGF" , destfile="ProximityAreas.zip")
unzip("ProximityAreas.zip")
Proximity_Areas <- st_read("Proximity_Areas.gpkg")
Dist_Matrix <- Proximity_Areas %>%
st_distance(. , by_element = FALSE)
This function uses sf and igraph package functions:
group_polygons <- function(polys, distance){
## get distance matrix
dist_matrix = st_distance(polys, by_element = FALSE)
## this object has units, so get rid of them:
class(dist_matrix) = NULL
## make a binary 0/1 matrix where 1 if two polys are inside the distance threshold
connected = dist_matrix < distance
## make a graph
g = igraph::graph_from_adjacency_matrix(connected)
return(components(g)$membership)
}
You can use it like this:
Proximity_Areas$Group = group_polygons(Proximity_Areas, 1200)
Let's make a category for mapping:
Proximity_Areas$FGroup = factor(Proximity_Areas$Group)
plot(Proximity_Areas[,"FGroup"])
There are three clusters here, the big one, one with 3 regions on the right, and one singleton region on the left. All the orange regions could be connected together by bridges that are less than 1200m long.
If you want to compute the average distance without re-computing the distance matrix, you can do this within the function by subsetting according to the membership value from the components function. The key here is computing the binary 0/1 matrix and using igraph to compute the connectivity of that as an adjacency matrix.
I have a series of daily values, y. For each day, di (i.e., each row), I would like to calculate the (graph) area, ai, of the region between the curve and the horizontal line y = yi between di and the most recent previous occurrence of the value yi. Sketch below. Because observations occur at regular, discrete timesteps (daily), the calculated area, ai, is equivalent to the sum of the daily differences between each daily y and yi (black bars in figure). I'm interested only in valleys, so the calculated area, ai, can be set to 0 when y is decreasing (yi - yi-1 <= 0).
Toy data below. Expected result shown in dat$a.
dat$a[6] was calculated from 55 - 50;
dat$a[7] was calculated from (60-55)+(60-50). And so on.
dat = data.frame(d = seq.Date(as_date("2021-01-01"),as_date("2021-01-10"),by = "1 day"),
y = c(100,95,90,70,50,55,60,75,85,90),
a = c(0,0,0,0,0,5,15,65,115,145))
My first thought was to calculate the area between the curve and the horizontal line y = yi between days di and the the most recent previous occurrence of the value yi, using perhaps geiger::area.between.curves(), but I couldn't work out how to identify most recent previous occurrence of the value yi.
[In case the context helps, the actual data are daily values of the area (m2) of a wetland not submerged by water. When the water rises, a portion of the wetland that had been dry for some time becomes wet. Here, I'm trying to calculate the extent of the reflooding in m2-days. A portion of the wetland that has been dry for a long time but becomes reflooded will contribute many m2-days to the sum.]
I'm most comfortable in the tidyverse, and such answers are greatly preferred. I am not familiar with data.table.
Thanks in advance
Update
I was able to able to achieve my desired calculation in Excel, though it's brutally inelegant. Couple hundred rows in an example, linked below. Given that my real data are 180k rows, my poor machine hated the 18 million calculated cells. Though I can move on with my analysis, I am still very interested in an R solution. My implemented approach differs subtly from my imagined R approach in that it's summing 'horizontal rectangles', so to speak, each of the same (small) y-unit height, rather than 'vertical rectangles', each of unit width.
Here's the file.
Since the question is missing complete information we will compute the the area under the curve assuming that a day is one unit. Modify as appropriate for your specific problem.
library(pracma)
nr <- nrow(dat)
dat0 <- dat[c(1, 1:nr, nr), ]
dat0[c(1, nr), "y"] <- 0
with(dat0, abs(polyarea(as.numeric(d), y)))
I have a made up dataset of polling stations in Wales and I've attached a date column to it. We can imagine this date is the date this polling station was visited to check the facilities (for example).
What I'd like to do is work out :
I would like to work out whether geographic points are within a certain distance
This I've managed by self_joining and using st_buffer and st_within to calculate within 1000 m and then calculated the number of neighbours.
and also the interval between the sample dates
this I'm having a bit of a problem with
What I'd like to do, I think, is
for each polling station
calculate the number of neighbours (so far so easy)
for each neighbour determine the interval between the sampling dates
return a spatial object (for plotting in tmaps probably)
Here's some test code that I've got that generates the sf dataset, calculates the number of neighbours and returns that.
It's really the date interval that's stumping me. It's not so much the calculation of the date interval but it's the way to generate these clusters of polling stations with date intervals.
Is it better to generate the (in this case) 108 polling station clusters?
What I'm trying to do in my larger dataset is calculate clusters of points over time.
I have ~2000 records with a date. I'd like to say :
for each of these 2000 records calculate the number of neighbours within a distance and within a timeframe.
I think it's probably better to
calculate each cluster of neighbouring points and visualise
then
remove neighbours from the cluster that are outside of the time frame and visualise that
Although, on typing this, I wonder if excluding points that didn't fall within a timeframe first and then calculating neighbours would be more efficient?
polls<-st_as_sf(read.csv(url("https://www.caerphilly.gov.uk/CaerphillyDocs/FOI/Datasets_polling_stations_csv.aspx")),
coords = c("Easting","Northing"),crs = 27700)%>%
mutate(date = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/31'), by="day"), 147))
test_stack<-polls%>%st_join(polls%>%st_buffer(dist=1000),join=st_within)%>%
filter(Ballot.Box.Polling.Station.x!=Ballot.Box.Polling.Station.y)%>%
add_count(Ballot.Box.Polling.Station.x)%>%
rename(number_of_neighbours = n)%>%
mutate(interval_date = date.x-date.y)%>%
subset(select = -c(6:8,10,11,13:18))## removing this comment will summarise the data so that only number of neighbours is returned %>%
distinct(Ballot.Box.Polling.Station.x,number_of_neighbours,date.x)%>%
filter(number_of_neighbours >=2)
I think it might be as simple as
tm_shape(test_stack)+tm_dots(col = "number_of_neighbours", clustering =T, size = 0.5)
I'm not sure how clustering works in leaflet, but that works quite nicely on this test data.
I want to assess the degree of spatial proximity of each point to other equivalent points by looking at the number of others within 400m (5 minute walk).
I have some points on a map.
I can draw a simple 400 m buffer around them.
I want to determine which buffers overlap and then count the number of overlaps.
This number of overlaps should relate back to the original point so I can see which point has the highest number of overlaps and therefore if I were to walk 400 m from that point I could determine how many other points I could get to.
I've asked this question in GIS overflow, but I'm not sure it's going to get answered for ArcGIS and I think I'd prefer to do the work in R.
This is what I'm aiming for
https://www.newham.gov.uk/Documents/Environment%20and%20planning/EB01.%20Evidence%20Base%20-%20Cumulative%20Impact%20V2.pdf
To simplify here's some code
# load packages
library(easypackages)
needed<-c("sf","raster","dplyr","spData","rgdal",
"tmap","leaflet","mapview","tmaptools","wesanderson","DataExplorer","readxl",
"sp" ,"rgisws","viridis","ggthemes","scales","tidyverse","lubridate","phecharts","stringr")
easypackages::libraries(needed)
## read in csv data; first column is assumed to be Easting and second Northing
polls<-st_as_sf(read.csv(url("https://www.caerphilly.gov.uk/CaerphillyDocs/FOI/Datasets_polling_stations_csv.aspx")),
coords = c("Easting","Northing"),crs = 27700)
polls_buffer_400<-st_buffer(plls,400)
polls_intersection<-st_intersection(x=polls_buffer_400,y=polls_buffer_400)
plot(polls_intersection$geometry)
That should show the overlapping buffers around the polling stations.
What I'd like to do is count the number of overlaps which is done here:
polls_intersection_grouped<-polls_intersection%>%group_by(Ballot.Box.Polling.Station)%>%count()
And this is the bit I'm not sure about, to get to the output I want (which will show "Hotspots" of polling stations in this case) how do I colour things? How can I :
asess the degree of spatial proximity of each point to other equivalent points by looking at the number of others within 400m (5 minute walk).
It's probably terribly bad form but here's my original GIS question
https://gis.stackexchange.com/questions/328577/buffer-analysis-of-points-counting-intersects-of-resulting-polygons
Edit:
this gives the intersections different colours which is great.
plot(polls_intersection$geometry,col = sf.colors(categorical = TRUE, alpha = .5))
summary(lengths(st_intersects(polls_intersection)))
What am I colouring here? I mean it looks nice but I really don't know what I'm doing.
How can I : asess the degree of spatial proximity of each point to other equivalent points by looking at the number of others within 400m (5 minute walk).
Here is how to add a column to your initial sfc of pollings stations that tells you how many polling stations are within 400m of each feature in that sfc.
Note that the minimum value is 1 because a polling station is always within 400m of itself.
# n_neighbors shows how many polling stations are within 400m
polls %>%
mutate(n_neighbors = lengths(st_is_within_distance(polls, dist = 400)))
Similarly, for your sfc collection of intersecting polygons, you could add a column that counts the number of buffer polygons that contain each intersection polygon:
polls_intersection %>%
mutate(n_overlaps = lengths(st_within(geometry, polls_buffer_400)))
And this is the bit I'm not sure about, to get to the output I want (which will show "Hotspots" of polling stations in this case) how do I colour things?
If you want to plot these things I highly recommend using ggplot2. It makes it very clear how you associate an attribute like colour with a specific variable.
For example, here is an example mapping the alpha (transparency) of each polygon to a scaled version of the n_overlaps column:
library(ggplot2)
polls_intersection %>%
mutate(n_overlaps = lengths(st_covered_by(geometry, polls_buffer_400))) %>%
ggplot() +
geom_sf(aes(alpha = 0.2*n_overlaps), fill = "red")
Lastly, there should be a better way to generate your intersecting polygons that already counts overlaps. This is built in to the st_intersection function for finding intersections of sfc objects with themselves.
However, your data in particular generates an error when you try to do this:
st_intersection(polls_buffer_400)
# > Error in CPL_nary_intersection(x) :
#> Evaluation error: TopologyException: side location conflict at 315321.69159061194 199694.6971799387.
I don't know what a "side location conflict" is. Maybe #edzer could help with that. However, most subsets of your data do not contain that conflict. For example:
# this version adds an n.overlaps column automatically:
st_intersection(polls_buffer_400[1:10,]) %>%
ggplot() + geom_sf(aes(alpha = 0.2*n.overlaps), fill = "red")
The figure is the plot of x,y set in a excel file, total 8760 pair of x and y. I want to remove the noise data pair in red circle area and output a new excel file with remain data pair. How could I do it in R?
Using #G5W's example:
Make up data:
set.seed(2017)
x = runif(8760, 0,16)
y = c(abs(rnorm(8000, 0, 1)), runif(760,0,8))
XY = data.frame(x,y)
Fit a quantile regression to the 90th percentile:
library(quantreg)
library(splines)
qq <- rq(y~ns(x,20),tau=0.9,data=XY)
Compute and draw the predicted curve:
xvec <- seq(0,16,length.out=101)
pp <- predict(qq,newdata=data.frame(x=xvec))
plot(y~x,data=XY)
lines(xvec,pp,col=2,lwd=2)
Keep only points below the predicted line:
XY2 <- subset(XY,y<predict(qq,newdata=data.frame(x)))
plot(y~x,data=XY2)
lines(xvec,pp,col=2,lwd=2)
You can make the line less wiggly by lowering the number of knots, e.g. y~ns(x,10)
Both R and EXCEL read and write .csv files, so you can use those to transfer the data back and forth.
You do not provide any data so I made some junk data to produce a similar problem.
DATA
set.seed(2017)
x = runif(8760, 0,16)
y = c(abs(rnorm(8000, 0, 1)), runif(760,0,8))
XY = data.frame(x,y)
One way to identify noise points is by looking at the distance to the nearest neighbors. In dense areas, nearest neighbors will be closer. In non-dense areas, they will be further apart. The package dbscan provides a nice function to get the distance to the k nearest neighbors. For this problem, I used k=6, but you may need to tune for your data. Looking at the distribution of distances to the 6th nearest neighbor we see that most points have 6 neighbors within a distance of 0.2
XY6 = kNNdist(XY, 6)
plot(density(XY6[,6]))
So I will assume that point whose 6th nearest neighbor is further away are noise points. Just changing the color to see which points are affected, we get
TYPE = rep(1,8760)
TYPE[XY6[,6] > 0.2] = 2
plot(XY, col=TYPE)
Of course, if you wish to restrict to the non-noise points, you can use
NonNoise = XY[XY6[,6] > 0.2,]