How slow is too slow when kriging with gstat in R - r

I am trying to use the krige function in the gstat package of R to interpolate some spatial ocean depth data in R. I am finding for more than about ~1000 points, the function starts taking unreasonable amounts of time to finish (i.e., hours to days to hasn't ever finished). Is this normal or am I doing something wrong? I am particularly concerned because my eventual goal is to do spatio-temporal kriging of a very large dataset (>30,000 data points) and I am worried that it just won't be feasible given these run times.
I am running gstat-1.1-3 and R-3.3.2. Below is the code I am running:
library(sp); library(raster); library(gstat)
v.utm # SpatialPointsDataFrame with >30,000 points
# Remove points with identical positons
zd = zerodist(v.utm)
nzd = v.utm[-zd[,1],] # Layer with no identical positions
# Make a raster layer covering point layer
resolution=1e4
e = extent(as.matrix(v.utm#coords))+resolution
r = raster(e,resolution=resolution)
proj4string(r) = proj4string(v.utm)
# r is a 181x157 raster
# Fit variogram
fv = fit.variogram(variogram(AVGDEPTH~1, nzd),model=vgm(6000,"Exp",1,5e5,1))
# Krige on random sample of 500 points - works fine
size=500
ss=nzd[sample.int(nrow(nzd),size),]
depth.krig = krige(AVGDEPTH~1,ss,as(r,"SpatialPixelsDataFrame"),
model=depth.fit)
# Krige on random sample of 5000 points - never seems to end
size=5000
ss=nzd[sample.int(nrow(nzd),size),]
depth.krig = krige(AVGDEPTH~1,ss,as(r,"SpatialPixelsDataFrame"),
model=depth.fit)

The complexity of the choleski decomposition (or similar) is O(n^3), meaning that if you multiply the number of points by 10, the time it will take increases with a factor 1000. There are two ways out of this problem, at least for as far as gstat is concerned:
install an optimized version of BLAS (eg OpenBLAS, or MKL) - this does not solve the O(n^3) problem, but may speed up maximally a factor n with n the number of cores available
Avoid decomposing the full covariance matrix by choosing local neighbourhoods (arguments maxdist and/or nmax)

A much faster alternative to kriging for large datasets is griddify in the marmap package. It took me a while to find this, but it works well. It uses bilinear interpolation and although it is designed for bathymetric maps, it works with any xyz data.

Related

How can i reduce the running time when compute variogram for large SaptialPointsDataframe in R

I am a beginner in R and geostatistics. I have a large SpatialPointsDataframe (307,907 elements, 27.2MB). When I want to use this data to build a variogram, it needs a lot of time (R runs for a day without any results). I think this is probably because R normally does single-core processing. Thus, is there a way for R to work with multiple cores in building a variogram? Is the function foreach can do this?
library(reader)
library(gstat)
library(sp)
coordinates(data_del) <- ~X1+X2
variogram <- variogram(X3~1, data_del, cutoff = 100, width =10)

issues using Spatial autocorrelation in R at specific lags (in m)

Since a few days I am struggling with a new challenging spatial analysis which include spatial autocorrelation in R: Specifically, I am interested in verifying the autocorrelation between points set in a grid of 50 m (more or less). My aim is to test the autocorrelation between these points (the locations where I collected the data) and to verify if the autocorrelation decreases increasing the distance among them (this is expected). My idea is to generate different radius of specific meters around each point (50 m, 100 m, 150 m and so on...) and to test the Moran's I Autocorrelation Index. Finally I would like to use ggplot to display the MI at each specific distance results (but this is easy to get once I have the MI outputs...).
My starting dataframe contains 4 coloumns: the ID of the point where data where collected, the values measured at that specific points (z) a coloumn with longitude (x) and a coloumn with latitude(y),data are displayed as follows:
#install libraries
library(sp)
library(spdep)
library(splm)
library(ape)
ID<- c(1,2,3,4,5,6)
x<-c(20.99984,20.99889, 20.99806,20.99800,20.99700,20.99732)
y<-c(52.21511,52.21489,52.21464,52.21410,52.21327,52.21278)
z<-c(1.16,0.54,0.89,0.60,1.27,1.45)
data <- data.frame(ID,x,y,z)
I read many things online and found this tutorial
https://mgimond.github.io/Spatial/spatial-autocorrelation-in-r.html#morans-i-as-a-function-of-a-distance-band
which actually shows what I'm interested in: however, it doesn't really work from the real beginning and, starting from my coordinates, I think there is a problem and I don't know how to tranform them in a proper format for R. this is the error message I get:
data <- data.frame(dataPOL$Long , dataPOL$Lat, dataPOL$Human_presence)
coordinates(data) <- c('x','y')`
proj4string(data) <- "+init=epsg:4326"
S.dist <- dnearneigh(coordinates, 0, 50) #radius of 50 meters
Error in dnearneigh(coordinates, 0, 50) : Data non-numeric
I did not receive any answer, but I ended up finding a solution:
I have found that the most used packages to work with spatial autocorrelation in R (in my case, Moran I) are spdep and ape.
I tried both: spdep didn't work yet but ape did. Here is the tutorial I followed for my specific case:
https://stats.idre.ucla.edu/r/faq/how-can-i-calculate-morans-i-in-r/
before calculate the Moran index, you should generate a distance matrix, I did it with the ‘rdist.earth’ from the package 'fields'.
This function measures the distance between each set of data points based on their coordinates. This function recognizes that the world is not flat, and as such calculates what are known as great-circle distances. I specified the distance in Km for my specific case.
to calculate Moran I, I ran this:
library(ape)
pop.dists.1 <- (popdists > 0 & popdists <= .06) # radius of 60m (remember
that field package works in km or miles)
Moran.I(mydataframe$myzvariable, pop.dists.1)
This is the output I got at this specific radius:
pop.dists.1 <- (popdists > 0 & popdists <= .06) #60m
Moran.I(dataPOL$Human_presence, pop.dists.1)
$observed
[1] 0.3841241 #Moran index: between -1 and 1, in here points within 60 m are
autocorrelated
$expected
[1] -0.009615385
$sd
[1] 0.08767598
$p.value
[1] 7.094019e-06
I repeated the formulas for the distances I am interested in: it works really well and increasing the distance, the Moran I index approximate 0 (which is what I expected).
I am going to plot the single outputs by using ggplot as always, in order to follow the trend of spatial autocorrelation for my z variable.
Hope this will help if needed!

DBSCAN Clustering with additional features

Can I apply DBSCAN with other features in addition to location ? and if it is available how can it be done through R or Spark ?
I tried preparing an R table of 3 columns one for latitude, longitude and score (the feature I wanna cluster upon in addition to space feature) and when tried running DBSCAN with the following R code, I get the following plot which tells that the algorithm makes clusters upon each pair of columns (long, lat), (long, score), (lat, score), ...
my R Code:
df = read.table("/home/ahmedelgamal/Desktop/preparedData")
var = dbscan(df, eps = .013)
plot(x = var, data = df)
and the plot I get:
You are misinterpreting the plot.
You don't get one result per plot, but all plots show the same clusters, only in different attributes.
But you also have the issue that the R version is (to my knowledge) only fast for Euclidean distance.
In your current code, points are neighbors if (lat[i]-lat[j])^2+(lon[i]-lon[j])^2+(score[i]-score[j])^2 <= eps^2. This bad because: 1. latitude and longitude are not Euclidean, you should be using haversine instead, and 2. your additional attribute has much larger scale and thus you pretty much only cluster points with near-zero score, and 3) your score attribute is skewed.
For this problrm you should probably be using Generalized DBSCAN. Points are similar if their haversine distance is less than e.g. 1 mile (you want to measure geographic distance here, not coordinates, because of distortion) and if their score differs by a factor of at most 1.1 (i.e. compare score[y] / score[x] or work in logspace?). Since you want both conditipns to hold, the usual Euclidean DBSCAN implementation is not yet enough, but you need a Generalized DBSCAN that allows multiple conditions. Look for an implementation of Generalized DBSCAN instead (I believe there id one in ELKI that you may be able to access from Spark), or implement it yourself. It's not very hard to do.
If quadratic runtime is okay for you, you can probably use any distance-matrix-based DBSCAN, and simply "hack" a binary distance matrix:
compute Haversine distances
compute Score dissimilarity
distance = 0 if haversine < distance-threshold and score-dissimilarity < score-threshold, otherwise 1.
run DBSCAN with precomputed distance matrix and eps=0.5 (since it is a binary matrix, don't change eps!)
It's reasonably fast, but needs O(n^2) memory. In my experience, the indexes of ELKI yield a good speedup if you have larger data, and are worth a try if you run out of memory or time.
You need to scale your data. V3 has a range which is much larger than the range for the V1 and V2 and thus DBSCAN currently mostly ignores V3.

Function and data format for doing vector-based clustering in R

I need to run clustering on the correlations of data row vectors, that is, instead of using individual variables as clustering predictor variables, I intend to use the correlations between the vector of variables between data rows.
Is there a function in R that does vector-based clustering. If not and I need to do it manually, what is the right data format to feed in a function such as cmeans or kmeans?
Say, I have m variables and n data rows, the m variables constitute one vector for each data row. so I have a n X n matrix for correlation or cosine. Can this matrix be plugged in the clustering function directly or certain processing is required?
Many thanks.
You can transform your correlation matrix into a dissimilarity matrix,
for instance 1-cor(x) (or 2-cor(x) or 1-abs(cor(x))).
# Sample data
n <- 200
k <- 10
x <- matrix( rnorm(n*k), nr=k )
x <- x * row(x) # 10 dimensions, with less information in some of them
# Clustering
library(cluster)
r <- pam(1-cor(x), diss=TRUE, k=5)
# Check the results
plot(prcomp(t(x))$x[,1:2], col=r$clustering, pch=16, cex=3)
R clustering is often a bit limited. This is a design limitation of R, since it heavily relies on low-level C code for performance. The fast kmeans implementation included with R is an example of such a low-level code, that in turn is tied to using Euclidean distance.
There are a dozen of extensions and alternatives available in the community around R. There are PAM, CLARA and CLARANS for example. They aren't exactly k-means, but closely related. There should be a "spherical k-means" somewhere, that is sensible for cosine distance. There is the whole family of hierarchical clusterings (which scale rather badly - usually O(n^3), with O(n^2) in a few exceptions - but are very easy to understand conceptually).
If you want to explore some more clustering options, have a look at ELKI, it should allow clustering (with various methods, including k-means) by correlation based distances (and it also includes such distance functions). It's not R, though, but Java. So if you are bound to using R, it won't work for you.

Efficient moving window statistics for matrix and/or spatial data (neighborhood statistics) in R

I'm using the raster and related packages in R to do a bit of remote sensing work. For a number of the functions I'm writing, I'd love to rapidly compute neighborhood / moving window statistics. Unfortunately, any R implementations I or others write are very, very slow.
I know that the caTools package offers this functionality written in C for vectors / time series, which yields a 10X+ time savings. Is anyone familiar with a similar package or function that provides this functionality for matrices and spatial data?
Quick example:
# Generate a raster with random values
r <- raster(nrows=100, ncols=100)
values(r) <- rbinom(dim(r)[1] * dim(r)[2], 1, 0.1)
# Now generate a raster highlighting the original values plus immediate neighbors
# (By default ngb yields a queen-esque weighting system)
r.neighbor <- focal(r, ngb=3, fun=max)
# system.time() of the above function for a 100x100 raster takes 0.8 seconds on my laptop
# and takes over 15 seconds for a 1000x1000 raster
Ideally, I'd like to do this faster and for much larger rasters.
Much thanks,
Nick
Ps. There's some interesting discussion of the massive speed differences across R functions for doing moving window operations on vectors here: http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html

Resources