Speed up raster::extract with weights in r - r

I want to extract the precise mean value of raster values from an area extent defined by a polygon in r. This works using raster::extract with the option weights=TRUE. However, this operation becomes prohibitively slow with large rasters and the function doesn't seem to be parallelized, thus beginCluster() ... endCluster() does not speed up the process.
I need to extract the values for a range of rasters, exemplified here as r, r10 and r100. Is there a way to speed this up in r, or is there an alternative way of doing this in GDAL?
r <- raster(nrow=1000, ncol=1000, vals=sample(seq(0,0.8,0.01),1000000,replace=TRUE))
r10 <- aggregate(r, fact=10)
r100 <- aggregate(r, fact=100)
v = Polygons(list(Polygon(cbind(c(-100,100,80,-120), c(-70,0,70,0)))), ID = "a")
v = SpatialPolygons(list(v))
plot(r)
plot(r10)
plot(r100)
plot(v, add=T)
system.time({
precise.mean <- raster::extract(r100, v, method="simple",weights=T, normalizeWeights=T, fun=mean)
})
user system elapsed
0.251 0.000 0.253
> precise.mean
[,1]
[1,] 0.3994278
system.time({
precise.mean <- raster::extract(r10, v, method="simple",weights=T, normalizeWeights=T, fun=mean)
})
user system elapsed
7.447 0.000 7.446
precise.mean
[,1]
[1,] 0.3995429

In the end I resorted the problem using gdalUtils working directly on the harddisk.
I used the command gdalwarp() to reduce the raster resolution to r10, 100.
Then gdalwarp() to increase the resolution of the resulting raster to the original resolution of r.
Then gdalwarp() with cutline= "v.shp", crop_to_cutline =T to mask the raster to the vector v.
And then gdalinfo() combined with subset(x(grep("Mean=",x))) to extract the mean values.
All of this was packed in a foreach() %dopar% loop to process a number of rasters and resolution.
While complicated and probably not as precise as extract::raster, it did the job.

It should actually run faster if you first call beginCluster (the function then deals with the parallelization). Even better would be to use version 2.7-14 which has a much faster implementation. It is currently under review at CRAN, but you can also get it here: https://github.com/rspatial/raster

Related

How to calculate aggregate fact parameter in terra package?

The terra package has the aggregate function that allows to create a new SpatRaster with a lower resolution (larger cells) but needs the fact parameter.
When converting many rasters, fact needs to be calculated each time, is there a way to pass the fact parameter based on the target resolution of another raster? Other functions take an existing raster as input, such as function(r1,r2)
library(terra)
r1 <- rast(ncol=10,nrow=10)
r2 <- rast(ncol=4,nrow=4)
values(r1) <- runif(ncell(r1))
values(r2) <- runif(ncell(r2))
I have tried
r3 = aggregate(r1,fact=res(r1)/res(r2))
Error: [aggregate] values in argument 'fact' should be > 0
Found the answer, I had the res(r1)/res(r2) inverted, it should be
r3 = aggregate(r1,fact=res(r2)/res(r1))
Still it would be much better just to pass the name of the target's raster resolution.
You cannot aggregate r2 to r1 because res(r2)/res(r1) does not return whole numbers.
res(r2)/res(r1)
#[1] 2.5 2.5
More generally, you cannot assume that you can aggregate one raster to another, so having another raster as second argument is not as obvious as with other methods such as resample.
In this special case you can do
aggregate(disagg(r1, 2), 5)

How to use doParallel for calculating distance between zipcodes in R?

I have a large dataset (2.6M rows) with two zip codes and the corresponding latitudes and longitudes, and I am trying to compute the distance between them. I am primarily using the package geosphere to calculate Vincenty Ellipsoid distance between the zip codes but it is taking a massive amount of time for my dataset. What can be a fast way to implement this?
What I tried
library(tidyverse)
library(geosphere)
zipdata <- select(fulldata,originlat,originlong,destlat,destlong)
## Very basic approach
for(i in seq_len(nrow(zipdata))){
zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
c(zipdata$destlat[i],zipdata$destlong[i]),
fun=distVincentyEllipsoid)
}
## Tidyverse approach
zipdata <- zipdata%>%
mutate(dist2 = distm(cbind(originlat,originlong), cbind(destlat,destlong),
fun = distHaversine))
Both of these methods are extremely slow. I understand that 2.1M rows will never be a "fast" calculation, but I think it can be made faster. I have tried the following approach on a smaller test data without any luck,
library(doParallel)
cores <- 15
cl <- makeCluster(cores)
registerDoParallel(cl)
test <- select(head(fulldata,n=1000),originlat,originlong,destlat,destlong)
foreach(i = seq_len(nrow(test))) %dopar% {
library(geosphere)
zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
c(zipdata$destlat[i],zipdata$destlong[i]),
fun=distVincentyEllipsoid)
}
stopCluster(cl)
Can anyone help me out with either the correct way to use doParallel with geosphere or a better way to handle this?
Edit: Benchmarks from (some) replies
## benchmark
library(microbenchmark)
zipsamp <- sample_n(zip,size=1000000)
microbenchmark(
dave = {
# Dave2e
zipsamp$dist1 <- distHaversine(cbind(zipsamp$patlong,zipsamp$patlat),
cbind(zipsamp$faclong,zipsamp$faclat))
},
geohav = {
zipsamp$dist2 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "haversine")
},
geovin = {
zipsamp$dist3 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "vincenty")
},
geocheap = {
zipsamp$dist4 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "cheap")
}
,unit = "s",times = 100)
# Unit: seconds
# expr min lq mean median uq max neval cld
# dave 0.28289613 0.32010753 0.36724810 0.32407858 0.32991396 2.52930556 100 d
# geohav 0.15820531 0.17053853 0.18271300 0.17307864 0.17531687 1.14478521 100 b
# geovin 0.23401878 0.24261274 0.26612401 0.24572869 0.24800670 1.26936889 100 c
# geocheap 0.01910599 0.03094614 0.03142404 0.03126502 0.03203542 0.03607961 100 a
A simple all.equal test showed that for my dataset the haversine method is equal to the vincenty method, but has a "Mean relative difference: 0.01002573" with the "cheap" method from the geodist package.
R is a vectorized language, thus the function will operate over all of the elements in the vectors. Since you are calculating the distance between the original and destination for each row, the loop is unnecessary. The vectorized approach is approximately 1000x the performance of the loop.
Also using the distVincentyEllipsoid (or distHaveersine, etc. )directly and bypassing the distm function should also improve the performance.
Without any sample data this snippet is untested.
library(geosphere)
zipdata <- select(fulldata,originlat,originlong,destlat,destlong)
## Very basic approach
zipdata$dist1 <- distVincentyEllipsoid(c(zipdata$originlong, zipdata$originlat),
c(zipdata$destlong, zipdata$destlat))
Note: For most of the geosphere functions to work correctly, the proper order is: longitude first then latitude.
The reason the tidyverse approach listed above is slow is the distm function is calculating the distance between every origin and destination which would result in a 2 million by 2 million element matrix.
I used #SymbolixAU's suggestion to use the geodist package to perform the 2.1M distance calculations on my datasets. I found it to be significantly faster than the geosphere package for every test (I have added one of them in my main question). The measure=cheap option in the geodist uses the cheap ruler method which has low error rates below distances of 100kms. See the geodist vignette for more information. Given some of my distances were higher than 100km, I settled on using the Vincenty Ellipsoid measure.
If you are going to use geosphere, I would either use a fast approximate method like distHaversine, or the still fast and very precise distGeo method. (The distVincenty* these are mainly implemented for curiosity).

silhouette calculation in R for a large data

I want to calculate silhouette for cluster evaluation. There are some packages in R, for example cluster and clValid. Here is my code using cluster package:
# load the data
# a data from the UCI website with 434874 obs. and 3 variables
data <- read.csv("./data/spatial_network.txt",sep="\t",header = F)
# apply kmeans
km_res <- kmeans(data,20,iter.max = 1000,
nstart=20,algorithm="MacQueen")
# calculate silhouette
library(cluster)
sil <- silhouette(km_res$cluster, dist(data))
# plot silhouette
library(factoextra)
fviz_silhouette(sil)
The code works well for smaller data, say data with 50,000 obs, however I get an error like "Error: cannot allocate vector of size 704.5 Gb" when the data size is a bit large. This might be problem for Dunn index and other internal indices for large datasets.
I have 32GB RAM in my computer. The problem comes from calculating dist(data). I am wondering if it is possible to not calculating dist(data) in advance, and calculate corresponding distances when it is required in the silhouette formula.
I appreciate your help regarding this problem and how I can calculate silhouette for large and very large datasets.
You can implement Silhouette yourself.
It only needs every distance twice, so storing an entire distance matrix is not necessary. It may run a bit slower because it computes distances twice, but at the same time the better memory efficiency may well make up for that.
It will still take a LONG time though.
You should consider to only use a subsample (do you really need to consider all points?) as well as alternatives such as Simplified Silhouette, in particular with KMeans... You only gain very little with extra data on such methods. So you may just use a subsample.
Anony-Mousse answer is perfect, particularly subsampling. This is very important for very large datasets due to the increase in computational cost.
Here is another solution for calculating internal measures such as silhouette and Dunn index, using an R package of clusterCrit. clusterCrit is for calculating clustering validation indices, which does not require entire distance matrix in advance. However, it might be slow as Anony-Mousse discussed. Please see the below link for documentation for clusterCrit:
https://www.rdocumentation.org/packages/clusterCrit/versions/1.2.8/topics/intCriteria
clusterCrit also calculates most of Internal measures for cluster validation.
Example:
intCriteria(data,km_res$cluster,c("Silhouette","Calinski_Harabasz","Dunn"))
If it is possible to calculate the Silhouette index, without using the distance matrix, alternatively you can use the clues package, optimizing both the time and the memory used by the cluster package. Here is an example:
library(rbenchmark)
library(cluster)
library(clues)
set.seed(123)
x = c(rnorm(1000,0,0.9), rnorm(1000,4,1), rnorm(1000,-5,1))
y = c(rnorm(1000,0,0.9), rnorm(1000,6,1), rnorm(1000, 5,1))
cluster = rep(as.factor(1:3),each = 1000)
df <- cbind(x,y)
head(df)
x y
[1,] -0.50442808 -0.13527673
[2,] -0.20715974 -0.29498142
[3,] 1.40283748 -1.30334876
[4,] 0.06345755 -0.62755613
[5,] 0.11635896 2.33864121
[6,] 1.54355849 -0.03367351
Runtime comparison between the two functions
benchmark(f1 = silhouette(as.integer(cluster), dist = dist(df)),
f2 = get_Silhouette(y = df, mem = cluster))
test replications elapsed relative user.self sys.self user.child sys.child
1 f1 100 15.16 1.902 13.00 1.64 NA NA
2 f2 100 7.97 1.000 7.76 0.00 NA NA
Comparison in memory usage between the two functions
library(pryr)
object_size(silhouette(as.integer(cluster), dist = dist(df)))
73.9 kB
object_size(get_Silhouette(y = df, mem = cluster))
36.6 kB
As a conclusion clues::get_Silhouette, it reduces the time and memory used to the same.

R: Faster way of computing large distance matrix

I am computing distance matrix between large number of locations (5000) on sphere (using Haversine distance function).
Here is my code:
require(geosphere)
x=rnorm(5000)
y=rnorm(5000)
xy1=cbind(x,y)
The time taken for computing the distance matrix is
system.time( outer(1:nrow(xy1), 1:nrow(xy1), function(i,j) distHaversine(xy1[i,1:2],xy1[j,1:2])))
The time taken to execute this program is high. Any suggestion how to lower time consumption to do this job! Thanks.
Try the built-in function in the geosphere package?
z <- distm( xy1 )
The default distance function for distm() - which calculates a distance matrix between a set of points - is the Haversine ("distHaversine") formula, but you may specify another using the fun argument.
On my 2.6GHz Core i7 rMBP this takes about 5 seconds for 5,000 points.
I add below a solution using the spatialrisk package. The key functions in this package are written in C++ (Rcpp), and are therefore very fast.
library(geosphere)
library(spatialrisk)
library(data.table)
x=rnorm(5000)
y=rnorm(5000)
xy1 = data.table(x,y)
# Cross join two data tables
coordinates_dt <- optiRum::CJ.dt(xy1, xy1)
system.time({
z <- distm( xy1 )
})
# user system elapsed
# 14.163 3.700 19.072
system.time({
distances_m <- coordinates_dt[, dist_m := spatialrisk::haversine(y, x, i.y, i.x)]
})
# user system elapsed
# 2.027 0.848 2.913

applying the pvclust R function to a precomputed dist object

I'm using R to perform an hierarchical clustering. As a first approach I used hclust and performed the following steps:
I imported the distance matrix
I used the as.dist function to transform it in a dist object
I run hclust on the dist object
Here's the R code:
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
hclust(d, "ward")
At this point I would like to do something similar with the function pvclust; however, I cannot because it's not possible to pass a precomputed dist object. How can I proceed considering that I'm using a distance not available among those provided by the dist function of R?
I've tested the suggestion of Vincent, you can do the following (my data set is a dissimilarity matrix):
# Import you data
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
# Compute the eigenvalues
x <- cmdscale(d,1,eig=T)
# Plot the eigenvalues and choose the correct number of dimensions (eigenvalues close to 0)
plot(x$eig,
type="h", lwd=5, las=1,
xlab="Number of dimensions",
ylab="Eigenvalues")
# Recover the coordinates that give the same distance matrix with the correct number of dimensions
x <- cmdscale(d,nb_dimensions)
# As mentioned by Stéphane, pvclust() clusters columns
pvclust(t(x))
If the dataset is not too large, you can embed your n points in a space of dimension n-1, with the same distance matrix.
# Sample distance matrix
n <- 100
k <- 1000
d <- dist( matrix( rnorm(k*n), nc=k ), method="manhattan" )
# Recover some coordinates that give the same distance matrix
x <- cmdscale(d, n-1)
stopifnot( sum(abs(dist(x) - d)) < 1e-6 )
# You can then indifferently use x or d
r1 <- hclust(d)
r2 <- hclust(dist(x)) # identical to r1
library(pvclust)
r3 <- pvclust(x)
If the dataset is large, you may have to check how pvclust is implemented.
It's not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by #Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on another question on CrossValidated). In the latter case, the authors of pvclust provide an example on how to use a custom distance function, although that means you will have to install their "unofficial version".

Resources