I am struggling a bit with two datasets containing coordinates of individuals and cell towers:
A first dataset on 9,459 individuals with 1,214 variables including their latitude and longitude in degrees.
a second dataset on 31,176 cell towers with 4 variables including their latitude and longitude in degrees, and range in meters.
I would like to determine whether an individual is in the range of at least one of the cell towers and create a dummy equal to 1 if it is the case.
However, due to the size of the datasets, I cannot merged them with the cross-join command. I tried using the geosphere package with the following command:
distm(c(df1$longitude, df2$latitude), c(df2$longitude, df2$latitude), fun= distHaversine)
Unfortunately, it does not work since the two datasets are not equally sized. Any idea of how to solve this issue?
Generally, this can be done much more efficiently to maximise RAM and processor usage and reduce overhead. However, if what you are trying to do is a one-time operation, below approach should be enough (takes around 5 minutes on a current notebook).
Helper function
# More info: https://github.com/RomanAbashin/distGeo_v
distGeo_v <- function(x, y, xx, yy) {
if(!"geosphere" %in% installed.packages()) {
stop("The 'geosphere' package needs to be installed for this function to work.")
}
matrix(.Call("_inversegeodesic",
as.double(x), as.double(y), as.double(xx), as.double(yy),
as.double(6378137), 1/298.257223563, PACKAGE='geosphere'),
ncol = 3, byrow = TRUE)[,1]
}
Data
library(geosphere)
library(tidyverse)
set.seed(1702)
users <- tibble(userid = 1:10000,
x = rnorm(10000, 16.3738, 5),
y = rnorm(10000, 48.2082, 5))
towers <- tibble(lon = rnorm(35000, 16.3738, 10),
lat = rnorm(35000, 48.2082, 10),
range = runif(35000, 50, 10000))
Code
result <- NULL
for(i in 1:nrow(users)) {
is_match <- users[i, 1:3] %>%
tidyr::crossing(towers[, 1:3]) %>%
filter(distGeo_v(x, y, lon, lat) <= range) %>%
nrow() > 0
result <- bind_rows(result, tibble(userid = users$userid[i],
match = is_match))
}
Result
> head(result)
# A tibble: 6 x 2
userid match
<int> <lgl>
1 1 TRUE
2 2 FALSE
3 3 FALSE
4 4 TRUE
5 5 FALSE
6 6 FALSE
Now you can left_join the result to your original data.
I add below a solution using the spatialrisk package. The key functions in this package are written in C++ (Rcpp), and are therefore very fast.
The function spatialrisk::points_in_circle() calculates the observations within radius from a center point. Note that distances are calculated using the Haversine formula. Since each element of the output is a data frame, purrr::map_dfr is used to row-bind them together:
library(tibble)
library(spatialrisk)
library(dplyr)
set.seed(1702)
users <- tibble(userid = as.character(1:10000),
lon = rnorm(10000, 16.3738, 1),
lat = rnorm(10000, 48.2082, 1))
towers <- tibble(lon = rnorm(35000, 16.3738, 1),
lat = rnorm(35000, 48.2082, 1))
# Users with tower within 200 meters
purrr::map2_dfr(users$lon, users$lat,
~points_in_circle(towers, .x, .y, radius = 200)[1,],
.id = "userid") %>%
mutate(inrange = ifelse(is.na(distance_m), FALSE, TRUE))
Related
I have the following data frame:
library(dplyr)
set.seed(42)
df <- data_frame(x = sample(seq(0, 1, 0.1), 5, replace = T), y = sample(seq(0, 1, 0.1), 5, replace = T), z= sample(seq(0, 1, 0.1), 5, replace = T) )
For each row in df, I would like to find out whether there is a row in df2 which is close to it ("neighbor") in all columns, where "close" means that it is not different by more than 0.1 in each column.
So for instance, a proper neighbor to the row (1, 0.5, 0.5) would be (0.9, 0.6, 0.4).
The second data set is
set.seed(42)
df2 <- data_frame(x = sample(seq(0, 1, 0.1), 10, replace = T), y = sample(seq(0, 1, 0.1), 10, replace = T), z= sample(seq(0, 1, 0.1), 10, replace = T) )
In this case there is no "neighbor", so Im supposed to get "FALSE" for all rows of df.
My actual data frames are much bigger than this (dozens of columns and hundreds of thousands of rows, so the naming has to be very general rather than "x", "y" and "z".
I have a sense that this can be done using mutate and funs, for example I tried this line:
df <- df %>% mutate_all(funs(close = (. <= df2(, .)+0.1) & (. >= df2(, .)-0.1))
But got an error.
Any ideas?
You can use package fuzzyjoin
library(fuzzyjoin)
# adding two rows that match
df2 <- rbind(df2,df[1:2,] +0.01)
df %>%
fuzzy_left_join(df2,match_fun= function(x,y) y<x+0.1 & y> x-0.1 ) %>%
mutate(found=!is.na(x.y)) %>%
select(-4:-6)
# # A tibble: 5 x 4
# x.x y.x z.x found
# <dbl> <dbl> <dbl> <lgl>
# 1 1 0.5 0.5 TRUE
# 2 1 0.8 0.7 TRUE
# 3 0.3 0.1 1 FALSE
# 4 0.9 0.7 0.2 FALSE
# 5 0.7 0.7 0.5 FALSE
find more info there: Joining/matching data frames in R
The machine learning approach to finding a close entry in a multi-dimensional dataset is Euclidian distance.
The general approach is to normalize all the attributes. Make the range for each column the same, zero to one or negative one to one. That equalizes the effect of the columns with large and small values. When more advanced approaches are used one would center the adjusted column values on zero. The test criteria is scaled the same.
The next step is to calculate the distance of each observation from its neighbors. If the data set is small or computing time is cheap, calculate the distance from every observation to every other. The Euclidian distance from observation1 (row1) to observation2 (row2) is sqrt((X1 - X2)^2 + sqrt((Y1 - Y2)^2 + ...). Choose your criteria and select.
In your case, the section criterion is simpler. Two observations are close if no attribute is more than 0.1 from the other observation. I assume that df and df2 have the same number of columns in the same order. I make the assumption that close observations are relatively rare. My approach tells me once we discover a pair is distant, discontinue investigation. If you have hundred of thousands of rows, you will likely exhaust memory if you try to calculate all the combinations at the same time.
~~~~~
You have a big problem. If your data sets df and df2 are one hundred thousand rows each, and four dozen columns, the machine needs to do 4.8e+11 comparisons. The scorecard at the end will have 1e+10 results (close or distant). I started with some subsetting to do comparisons with tearful results. R wanted matrices of the same size. The kluge I devised was unsuccessful. Therefore I regressed to the days of FORTRAN and did it with loops. With the loop approach, you could subset the problem and finish without smoking your machine.
From the sample data, I did the comparisons by hand, all 150 of them: nrow(df) * nrow(df2) * ncol(df). There were no close observations in the sample data by the definition you gave.
Here is how I intended to present the results before transferring the results to a new column in df.
dfclose <- matrix(TRUE, nrow = nrow(df), ncol = nrow(df2))
dfclose # Have a look
This matrix describes the distance from observation in df (rows in dfclose) to observation in df2 (colums in dfclose). If close, the entry is TRUE.
Here is the repository of the result of the distance measures:
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
dfdist # have a look; it's the same format, but with numbers
We start with the assumption that all observations in df aare close to df2.
The total distance is zero. To that we add the Manhattan Distance. When the total Manhattan distance is greater than .1, they are no longer close. We needn't evaluate any more.
closeCriterion <- function(origin, dest) {
manhattanDistance <- abs(origin-dest)
#print(paste("manhattanDistance =", manhattanDistance))
if (manhattanDistance < .1) ret <- 0 else ret <- 1
}
convertScore <- function(x) if (x>0) FALSE else TRUE
for (j in 1:ncol(df)) {
print(paste("col =",j))
for (i in 1:nrow(df)) {
print(paste("df row =",i))
for (k in 1:nrow(df2)) {
# print(paste("df2 row (and dflist column) =", k))
distantScore <- closeCriterion(df[i,j], df2[k,j])
#print(paste("df and dfdist row =", i, " df2 row (and dflist column) =", k, " distantScore = ", distantScore))
dfdist[i,k] <- dfdist[i,k] + distantScore
}
}
}
dfdist # have a look at the numerical results
dfclose <- matrix(lapply(dfdist, convertScore), ncol = nrow(df2))
I wanted to see what the process would look like at scale.
set.seed(42)
df <- matrix(rnorm(3000), ncol = 30)
set.seed(42)
df2 <-matrix(rnorm(5580), ncol = 30)
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
Then I ran the code block to see what would happen.
~ ~ ~
You might consider the problem definition. I ran the model several times, changing the criterion for closeness. If the entry in each of three dozen columns in df2 has a 90% chance of matching its correspondent in df, the row only has a 2.2% chance of matching. The example data is not such a good test case for the algorithm.
Best of luck
Here's one way to calculate that column without fuzzyjoin
library(tidyverse)
found <-
expand.grid(row.df = seq(nrow(df)),
row.df2 = seq(nrow(df2))) %>%
mutate(in.range = pmap_lgl(., ~ all(abs(df[.x,] - df2[.y,]) <= 0.1))) %>%
group_by(row.df) %>%
summarise_at('in.range', any) %>%
select(in.range)
Currently I have two data.frames, one of polygons (poly.x, poly.y, enum) and one of points (pt.x, pt.y) where enum is the id of the polygon. I am trying to determine which points belong to which polygons so I get a data.frame of (pt.x, pt.y, enum).
My first attempt uses point.in.polygon from the sp package and lapply functions to find which polygon(s) the point belongs to. While my code works, it takes a long time on large data sets.
My second attempt uses over also from the sp package, cobbled together from questions on gis stackexchange. While it is much faster, I cannot seem to get the correct output from over as it is a dataframe of 1s and NAs.
Below I've included a simplified working example (npoly can be changed to test the speed of different methods) as well as my working attempt using sp::point.in.polygon and nonsensical output from my sp::over attempt. I'm not fussed which method I end up using as long as it's fast.
Any help would be much appreciated!
#-------------------------------------------
# Libraries
library(ggplot2) # sample plots
library(dplyr) # bind_rows(), etc
library(sp) # spatial data
# Sample data
npoly = 100
# polygons
localpolydf <- data.frame(
x = rep(c(0, 1, 1, 0), npoly) + rep(0:(npoly-1), each = 4),
y = rep(c(0, 0, 1, 1), npoly),
enum = rep(1:npoly, each = 4))
# points
offsetdf <- data.frame(
x = seq(min(localpolydf$x) - 0.5, max(localpolydf$x) + 0.5, by = 0.5),
y = runif(npoly*2 + 3, 0, 1))
# Sample plot
ggplot() +
geom_polygon(aes(x, y, group = enum),
localpolydf, fill = NA, colour = "black") +
geom_point(aes(x, y), offsetdf)
#-------------------------------------------
# Dplyr and lapply solution for point.in.polygon
ptm <- proc.time() # Start timer
# create lists
offsetlist <- split(offsetdf, rownames(offsetdf))
polygonlist <- split(localpolydf, localpolydf$enum)
# lapply over each pt in offsetlist
pts <- lapply(offsetlist, function(pt) {
# lapply over each polygon in polygonlist
ptpoly <- lapply(polygonlist, function(poly) {
data.frame(
enum = poly$enum[1],
ptin = point.in.polygon(pt[1,1], pt[1,2], poly$x, poly$y))
})
ptpoly <- bind_rows(ptpoly) %>% filter(ptin != 0)
if (nrow(ptpoly) == 0) return(data.frame(x = pt$x, y = pt$y, enum = NA, ptin = NA))
ptpoly$x = pt$x
ptpoly$y = pt$y
return(ptpoly[c("x", "y", "enum", "ptin")])
})
pts_apply <- bind_rows(pts)
proc.time() - ptm # end timer
#-------------------------------------------
# Attempted sp solution for over
ptm <- proc.time() # Start timer
# Split the dataframe into a list based on enum and then remove enum from df in the list
polygonlist <- split(localpolydf, localpolydf$enum)
polygonlist <- lapply(polygonlist, function(x) x[,c("x", "y")])
# Convert the list to Polygon, then create a Polygons object
polygonsp <- sapply(polygonlist, Polygon)
polygonsp <- Polygons(polygonsp, ID = 1)
polygonsp <- SpatialPolygons(list(polygonsp))
plot(polygonsp)
# Convert points to coordinates
offsetps <- offsetdf
coordinates(offsetps) <- ~x+y
points(offsetps$x, offsetps$y)
# Determine polygons points are in
pts_sp <- over(offsetps, polygonsp)
proc.time() - ptm # end timer
#===========================================
# Output
# Apply: point.in.polygon
> head(pts_apply)
x y enum ptin
1 -0.5 0.2218138 NA NA
2 4.0 0.9785541 4 2
3 4.0 0.9785541 5 2
4 49.0 0.3971479 49 2
5 49.0 0.3971479 50 2
6 49.5 0.1177206 50 1
user system elapsed
4.434 0.002 4.435
# SP: over
> head(pts_sp)
1 2 3 4 5 6
NA 1 1 NA 1 NA
user system elapsed
0.048 0.000 0.047
An alternative to using over is to use sf::intersection as the sf package is becoming more and more popular.
Getting the data into sf objects took me a little bit of work but if you are working with external data you can just read in with st_read and it will already be in the correct form.
Here is how to approach:
library(tidyverse)
library(sf)
# convert into st_polygon friendly format (all polygons must be closed)
# must be a nicer way to do this!
localpoly <- localpolydf %>% split(localpolydf$enum) %>%
lapply(function(x) rbind(x,x[1,])) %>%
lapply(function(x) x[,1:2]) %>%
lapply(function(x) list(as.matrix(x))) %>%
lapply(function(x) st_polygon(x))
# convert points into sf object
points <- st_as_sf(offsetdf,coords=c('x','y'),remove = F)
#convert polygons to sf object and add id column
polys <- localpoly %>% st_sfc() %>% st_sf(geom=.) %>%
mutate(id=factor(1:100))
#find intersection
joined <- polys %>% st_intersection(points)
# Sample plot
ggplot() + geom_sf(data=polys) +
geom_sf(data=joined %>% filter(id %in% c(1:10)),aes(col=id)) +
lims(x=c(0,10))
Note that to use geom_sf at the time of writing you will need to install the development version of ggplot.
plot output:
over returns an index of points inside a geometry. Perhaps something like this:
xy <- offsetps[names(na.omit(pts_sp == 1)), ]
plot(polygonsp, axes = 1, xlim = c(0, 10))
points(offsetps)
points(xy, col = "red")
After having another look, I realised Roman did pts_sp == 1 because I only had 1 ID for all of my squares, i.e. when I did ID = 1.
Once I fixed that, I was able to a column with ID = enum. To handle points in multiple polygons I can use returnList = TRUE and add additional lines to convert the list to a data.frame but it isn't necessar here.
# Attempted sp solution
ptm <- proc.time() # Start timer
# Split the dataframe into a list based on enum and then remove enum from df in the list
polygonlist <- split(localpolydf, localpolydf$enum)
# Convert the list to Polygon, then create a Polygons object
polygonsp <- sapply(polygonlist, function(poly){
Polygons(list(Polygon(poly[, c("x", "y")])), ID = poly[1, "enum"])
})
# polygonsp <- Polygons(polygonsp, ID = 1)
polygonsp <- SpatialPolygons(polygonsp)
plot(polygonsp)
# Convert points to coordinates
offsetps <- offsetdf
coordinates(offsetps) <- ~x+y
points(offsetps$x, offsetps$y)
# Determine polygons points are in
pts_sp <- over(offsetps, polygonsp)
pts_sp <- data.frame(
x = offsetps$x, y = offsetps$y,
enum = unique(localpolydf$enum)[pts_sp])
proc.time() - ptm # end timer
I have a large spatial dataset (12M rows). The geometries are points on a map. For each row in the dataset, I'd like to find all the points that are within 500 meters of that point.
In r, using sf, I've been trying to do this by parallel looping through each row and running st_buffer and st_intersects, then saving the result as a list in a key-value format (the key being the origin point, the values being the neighbors).
The issue is that the dataset is too large. Even when parallelizing to upwards of 60 cores, the operation takes too long (>1 week and usually crashes).
What are the alternatives to this brute-force approach? Is it possible to build indexes using sf? Perhaps push the operation to an external database?
Reprex:
library(sf)
library(tidyverse)
library(parallel)
library(foreach)
# example data, convert to decimal:
nc <- st_read(system.file("shape/nc.shp", package="sf")) %>% st_transform(32618)
# expand the data a a bit to make the example more interesting:
nc <- rbind(nc,nc,nc)
nc <- nc %>% mutate(Id = row_number())
## can run in parallel if desired:
# num_cores <- parallel::detectCores()-2
# cl <- makeSOCKcluster(num_cores)
# registerDoSNOW(cl)
# or just run in sequence:
registerDoSEQ()
neighbors <- foreach(ii = 1:nrow(nc)
, .verbose = FALSE
, .errorhandling = "pass") %dopar% {
l = 500 # 500 meters
# isolate the row as the origin point:
row_interest <- filter(nc, row_number()==ii)
# create the buffer:
buffer <- row_interest %>% st_buffer(dist = l)
# extract the row numbers of the neighbors
comps_idx <- suppressMessages(st_intersects(buffer, nc))[[1]]
# get all the neighbors:
comps <- nc %>% filter(row_number() %in% comps_idx)
# remove the geometry:
comps <- comps %>% st_set_geometry(NULL)
# flow control in case there are no neibors:
if(nrow(comps)>0) {
comps$Origin_Key <- row_interest$Id
} else {
comps <- data_frame("lat" = NA_integer_,"lon" = NA_integer_, "bbl" = row_interest$bbl)
comps$Origin_Key <- row_interest$Id
}
return(comps)
}
closeAllConnections()
length(neighbors)==nrow(nc)
[1] TRUE
When working with sf objects, explicitly looping over features to perform
binary operations such as intersects is usually counterproductive (see also
How can I speed up spatial operations in `dplyr::mutate()`?)
An approach similar to yours (i.e., buffering and intersecting), but without
the explicit for loop works better.
Let's see how it performs on a reasonably big dataset of 50000 points:
library(sf)
library(spdep)
library(sf)
pts <- data.frame(x = runif(50000, 0, 100000),
y = runif(50000, 0, 100000))
pts <- sf::st_as_sf(pts, coords = c("x", "y"), remove = F)
pts_buf <- sf::st_buffer(pts, 5000)
coords <- sf::st_coordinates(pts)
microbenchmark::microbenchmark(
sf_int = {int <- sf::st_intersects(pts_buf, pts)},
spdep = {x <- spdep::dnearneigh(coords, 0, 5000)}
, times = 1)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> sf_int 21.56186 21.56186 21.56186 21.56186 21.56186 21.56186 1
#> spdep 108.89683 108.89683 108.89683 108.89683 108.89683 108.89683 1
You can see here that the st_intersects approach is 5 times faster than
the dnearneigh one.
Unfortunately, this is unlikely to solve your problem. Looking at execution
times for datasets of different sizes we get:
subs <- c(1000, 3000, 5000, 10000, 15000, 30000, 50000)
times <- NULL
for (sub in subs[1:7]) {
pts_sub <- pts[1:sub,]
buf_sub <- pts_buf[1:sub,]
t0 <- Sys.time()
int <- sf::st_intersects(buf_sub, pts_sub)
times <- cbind(times, as.numeric(difftime(Sys.time() , t0, units = "secs")))
}
plot(subs, times)
times <- as.numeric(times)
reg <- lm(times~subs+I(subs^2))
summary(reg)
#>
#> Call:
#> lm(formula = times ~ subs + I(subs^2))
#>
#> Residuals:
#> 1 2 3 4 5 6 7
#> -0.16680 -0.02686 0.03808 0.21431 0.10824 -0.23193 0.06496
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.429e-01 1.371e-01 1.772 0.151
#> subs -2.388e-05 1.717e-05 -1.391 0.237
#> I(subs^2) 8.986e-09 3.317e-10 27.087 1.1e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1908 on 4 degrees of freedom
#> Multiple R-squared: 0.9996, Adjusted R-squared: 0.9994
#> F-statistic: 5110 on 2 and 4 DF, p-value: 1.531e-07
Here, we see an almost perfect quadratic relationship between time and
number of points (as would be expected). On a 10M points subset, assuming
that the behaviour does not change, you would get:
predict(reg, newdata = data.frame(subs = 10E6))
#> 1
#> 898355.4
, which corresponds to about 10 days, assuming that the trend is constant
when further increasing the number of points (but the same would happen for
dnearneigh...)
My suggestion would be to "split" your points in chunks and then work on a
per-split basis.
You could for example order your points at the beginning along
the x-axis and then easily and quickly extract subsets of buffers and of points with which to compare them using data.table.
Clearly, the "points" buffer would need to be larger than that of "buffers" according
to the comparison distance. So, for example, if you make a subset of pts_buf with
centroids in [50000 - 55000], the corresponding subset of pts should include
points in the range [49500 - 55500].
This approach is easily parallelizable by assigning the different subsets to
different cores in a foreach or similar construct.
I do not even know if using spatial objects/operations is beneficial here, since once we have the coordinates all is needed is computing and subsetting euclidean distances: I suspect that a carefully coded brute force data.table-based approach could be also a feasible solution.
HTH!
UPDATE
In the end, I decided to give it a go and see how much speed we could gain from this kind of approach. Here is a possible implementation:
points_in_distance_parallel <- function(in_pts,
maxdist,
ncuts = 10) {
require(doParallel)
require(foreach)
require(data.table)
require(sf)
# convert points to data.table and create a unique identifier
pts <- data.table(in_pts)
pts <- pts[, or_id := 1:dim(in_pts)[1]]
# divide the extent in quadrants in ncuts*ncuts quadrants and assign each
# point to a quadrant, then create the index over "xcut"
range_x <- range(pts$x)
limits_x <-(range_x[1] + (0:ncuts)*(range_x[2] - range_x[1])/ncuts)
range_y <- range(pts$y)
limits_y <- range_y[1] + (0:ncuts)*(range_y[2] - range_y[1])/ncuts
pts[, `:=`(xcut = as.integer(cut(x, ncuts, labels = 1:ncuts)),
ycut = as.integer(cut(y, ncuts, labels = 1:ncuts)))] %>%
setkey(xcut, ycut)
results <- list()
cl <- parallel::makeCluster(parallel::detectCores() - 2, type =
ifelse(.Platform$OS.type != "windows", "FORK",
"PSOCK"))
doParallel::registerDoParallel(cl)
# start cycling over quadrants
out <- foreach(cutx = seq_len(ncuts)), .packages = c("sf", "data.table")) %dopar% {
count <- 0
# get the points included in a x-slice extended by `dist`, and build
# an index over y
min_x_comp <- ifelse(cutx == 1, limits_x[cutx], (limits_x[cutx] - maxdist))
max_x_comp <- ifelse(cutx == ncuts,
limits_x[cutx + 1],
(limits_x[cutx + 1] + maxdist))
subpts_x <- pts[x >= min_x_comp & x < max_x_comp] %>%
setkey(y)
for (cuty in seq_len(pts$ycut)) {
count <- count + 1
# subset over subpts_x to find the final set of points needed for the
# comparisons
min_y_comp <- ifelse(cuty == 1,
limits_y[cuty],
(limits_y[cuty] - maxdist))
max_y_comp <- ifelse(cuty == ncuts,
limits_y[cuty + 1],
(limits_y[cuty + 1] + maxdist))
subpts_comp <- subpts_x[y >= min_y_comp & y < max_y_comp]
# subset over subpts_comp to get the points included in a x/y chunk,
# which "neighbours" we want to find. Then buffer them.
subpts_buf <- subpts_comp[ycut == cuty & xcut == cutx] %>%
sf::st_as_sf() %>%
st_buffer(maxdist)
# retransform to sf since data.tables lost the geometric attrributes
subpts_comp <- sf::st_as_sf(subpts_comp)
# compute the intersection and save results in a element of "results".
# For each point, save its "or_id" and the "or_ids" of the points within "dist"
inters <- sf::st_intersects(subpts_buf, subpts_comp)
# save results
results[[count]] <- data.table(
id = subpts_buf$or_id,
int_ids = lapply(inters, FUN = function(x) subpts_comp$or_id[x]))
}
return(data.table::rbindlist(results))
}
parallel::stopCluster(cl)
data.table::rbindlist(out)
}
The function takes as input a points sf object, a target distance and a number
of "cuts" to use to divide the extent in quadrants, and provides in output
a data frame in which, for each original point, the "ids" of the points within
maxdist are reported in the int_ids list column.
On on a test dataset with a varying number of uniformly distributed point,
and two values of maxdist I got these kind of results (the "parallel" run is done using 6 cores):
So, here we get a 5-6X speed improvement already on the "serial" implementation, and another 5X thanks to parallelization over 6 cores.
Although the timings shown here are merely indicative, and related to the
particular test-dataset we built (on a less uniformly distributed dataset I wouldexpect a lower speed improvement) I think this is quite good.
HTH!
PS: a more thorough analysis can be found here:
https://lbusettspatialr.blogspot.it/2018/02/speeding-up-spatial-analyses-by.html
I have two alternatives, one that seems faster, and one that is not. The faster method may not be amenable for parallelization, unfortunately, and so it may not help.
library(sf)
nc <- st_transform(st_read(system.file("shape/nc.shp", package="sf")), 32618)
# create points
pts <- st_centroid(nc)
dis <- 50000
result <- list()
Your approach
system.time(
for (i in 1:nrow(pts)) {
b <- st_buffer(pts[i,], dist = dis)
result[[i]] <- st_intersects(b, nc)[[1]]
}
)
Slower alternative
system.time(
for (i in 1:nrow(pts)) {
b <- as.vector(st_distance(pts[i,], pts))
result[[i]] <- which(b <= dis)
}
)
For smaller datasets, without looping:
x <- st_distance(pts)
res <- apply(x, 1, function(i) which(i < dis))
Faster alternative (not obvious how to do in parallel), and perhaps an unfair comparison as we do not do the looping ourselves
library(spdep)
pts2 <- st_coordinates(pts)
system.time(x <- dnearneigh(pts2, 0, dis))
I would first get a list with the indices that indicate the neighbors, and extract attributes after that (that should be fast)
Working off of RobertH's answer, it is a bit faster to extract coordinates using sf::st_coordinates in this particular example.
library(sf)
library(spdep)
nc <- st_transform(st_read(system.file("shape/nc.shp", package="sf")), 32618)
# create points
pts <- st_centroid(nc)
dis <- 50000
# quickest solution:
x <- spdep::dnearneigh(sf::st_coordinates(pts), 0, dis)
microbenchmarking:
my_method <- function(pts) {
result <- list()
for (i in 1:nrow(pts)) {
b <- st_buffer(pts[i,], dist = dis)
result[[i]] <- st_intersects(b, nc)[[1]]
}
result
}
library(microbenchmark)
microbenchmark(
my_method(pts),
dnearneigh(as(pts, 'Spatial'), 0, dis),
dnearneigh(st_coordinates(pts), 0, dis)
)
Unit: microseconds
expr min lq mean median uq max neval
my_method(pts) 422807.146 427434.3450 435974.4320 429862.8705 434968.3975 596832.271 100
dnearneigh(as(pts, "Spatial"), 0, dis) 3727.221 3939.8540 4155.3094 4112.8200 4221.9525 7592.739 100
dnearneigh(st_coordinates(pts), 0, dis) 394.323 409.5275 447.1614 430.4285 484.0335 611.970 100
checking equivalence:
x <- dnearneigh(as(pts, 'Spatial'), 0, dis)
y <- dnearneigh(st_coordinates(pts), 0, dis)
all.equal(x,y, check.attributes = F)
[1] TRUE
I have a gridded climate dataset, such as:
# generate time vector
time1 <- seq(14847.5,14974.5, by = 1)
time2 <- seq(14947.5,14974.5, by = 1)
time <- c(time1,time2)
time <- as.POSIXct(time*86400,origin='1970-01-01 00:00')
# generate lat and lon coordinates
lat <- seq(80,90, by = 1)
lon <- seq(20,30, by = 1)
# generate 3dimensional array
dat <- array(runif(length(lat)*length(lon)*length(time)),
dim = c(length(lon),length(lat),length(time)))
such that
> dim(dat)
[1] 11 11 156
the dimensions of the data are describing the variable at different longitude (dim = 1), latitude (dim = 2), and time (dim = 3).
The issue I have at the moment is that some of the times are repeated, something to do with overlapping sensors measuring the data. Therefore, I was wondering if it was possible to only keep the unique times for dat, but average the data within the grid for the duplicated times i.e. if there are two repeated days we take the average value in each latitude and longitude grid for that time.
I can find the unique times as:
# only select unique times
new_time <- unique(time)
unique_time <- unique(time)
The following code then aims to loop through each grid (lat/lon) and average all of the duplicated days.
# loop through lat/lon coordinates to generate new data
new_dat <- array(dim = c(length(lon),length(lat),length(new_time)))
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
dat2 <- dat[i,ii,]
dat2b <- NA
for(k in 1:length(unique_time)){
idx <- time == unique_time[k]
dat2b[k] <- mean(dat2[idx], na.rm = TRUE)
}
new_dat[i,ii,] <- dat2b
}
}
I'm convinced that this provides the correct answer, but I'm certain there is a much cleaner method do achieve this.
I should also note that my data is quite large (i.e. k = 7000), so this last loop is not very efficient, to say the least.
My original answer:
This is a bit more concise and efficient by use of aggregate:
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
new_dat[i,ii,] <- as.numeric(aggregate(dat[i,ii,], by=list(time),mean)$x)
}
}
It still has 2 out of the 3 of the loops, but it manages to bypass creating dat2, dat2b, and unique_time.
My improved answer:
f <- function(i, ii){as.numeric(aggregate(dat[i,ii,], by=list(time),mean)$x)}
for(i in 1:nrow(expand.grid(1:length(lon),1:length(lat)))){
new_dat[expand.grid(1:length(lon),1:length(lat))[i,1],
expand.grid(1:length(lon),1:length(lat))[i,2],] <-
f(expand.grid(1:length(lon),1:length(lat))[i,1],expand.grid(1:length(lon),
1:length(lat))[i,2])
}
Got it down to just 1 loop. We could probably bypass that loop too with an apply.
After performing a cluster analysis to my dataset (a dataframe named data.matrix), I added a new column, named cluster, at the end (col 27) containing the cluster name that each instance belongs to.
What I want now, is a representative instance from each cluster. I tried to find the instance having the smallest euclidean distance from the cluster's centroid (and repeat the procedure for each one of my clusters)
This is what I did. Can you think of other -perhaps more elegant- ways? (assume numeric columns with no nulls).
clusters <- levels(data.matrix$cluster)
cluster_col = c(27)
for (j in 1:length(clusters)) {
# get the subset for cluster j
data = data.matrix[data.matrix$cluster == clusters[j],]
# remove the cluster column
data <- data[,-cluster_col]
# calculate the centroid
cent <- mean(data)
# copy data to data.matrix_cl, attaching a distance column at the end
data.matrix_cl <- cbind(data, dist = apply(data, 1, function(x) {sqrt(sum((x - cent)^2))}))
# get instances with min distance
candidates <- data.matrix_cl[data.matrix_cl$dist == min(data.matrix_cl$dist),]
# print their rownames
print(paste("Candidates for cluster ",j))
print(rownames(candidates))
}
At first I don't now if you distance formula is alright. I think there should be sqrt(sum((x-cent)^2)) or sum(abs(x-cent)). I assumed first.
Second thought is that just printing solution is not good idea. So I first compute, then print.
Third - I recommend using plyr but I give both (with and without plyr) solutions.
# Simulated data:
n <- 100
data.matrix <- cbind(
data.frame(matrix(runif(26*n), n, 26)),
cluster=sample(letters[1:6], n, replace=TRUE)
)
cluster_col <- which(names(data.matrix)=="cluster")
# With plyr:
require(plyr)
candidates <- dlply(data.matrix, "cluster", function(data) {
dists <- colSums(laply(data[, -cluster_col], function(x) (x-mean(x))^2))
rownames(data)[dists==min(dists)]
})
l_ply(names(candidates), function(c_name, c_list=candidates[[c_name]]) {
print(paste("Candidates for cluster ",c_name))
print(c_list)
})
# without plyr
candidates <- tapply(
1:nrow(data.matrix),
data.matrix$cluster,
function(id, data=data.matrix[id, ]) {
dists <- rowSums(sapply(data[, -cluster_col], function(x) (x-mean(x))^2))
rownames(data)[dists==min(dists)]
}
)
invisible(lapply(names(candidates), function(c_name, c_list=candidates[[c_name]]) {
print(paste("Candidates for cluster ",c_name))
print(c_list)
}))
Is the technique you're interested in 'k-means clustering'? If so, here's how the centroids are calculated at each iteration:
choose a k value (an integer that
specifies the number of clusters to
divide your data set);
random select k rows from your data
set, those are the centroids for the
1st iteration;
calculate the distance that each
data point is from each centroid;
each data point has a 'closest
centroid', that determines its
'group';
calculate the mean for each
group--those are the new centroids;
back to step 3 (stopping criterion
is usually based on comparison with
the respective centroid values in
successive loops, i.e., if they
values change not more than 0.01%,
then quit).
Those steps in code:
# toy data set
mx = matrix(runif60, 10, 99), nrow=12, ncol=5, byrow=F)
cndx = sample(nrow(mx), 2)
# the two centroids at iteration 1
cn1 = mx[cndx[1],]
cn2 = mx[cndx[2],]
# to calculate Pearson similarity
fnx1 = function(a){sqrt((cn1[1] - a[1])^2 + (cn1[2] - a[2])^2)}
fnx2 = function(a){sqrt((cn2[1] - a[1])^2 + (cn2[2] - a[2])^2)}
# calculate distance matrix
dx1 = apply(mx, 1, fnx1)
dx2 = apply(mx, 1, fnx2)
dx = matrix(c(dx1, dx2), nrow=2, ncol=12)
# index for extracting the new groups from the data set
ndx = apply(dx, 1, which.min)
group1 = mx[ndx==1,]
group2 = mx[ndx==2,]
# calculate the new centroids for the next iteration
new_cnt1 = apply(group1, 2, mean)
new_cnt2 = apply(group2, 2, mean)