Efficient moving window statistics for matrix and/or spatial data (neighborhood statistics) in R - r

I'm using the raster and related packages in R to do a bit of remote sensing work. For a number of the functions I'm writing, I'd love to rapidly compute neighborhood / moving window statistics. Unfortunately, any R implementations I or others write are very, very slow.
I know that the caTools package offers this functionality written in C for vectors / time series, which yields a 10X+ time savings. Is anyone familiar with a similar package or function that provides this functionality for matrices and spatial data?
Quick example:
# Generate a raster with random values
r <- raster(nrows=100, ncols=100)
values(r) <- rbinom(dim(r)[1] * dim(r)[2], 1, 0.1)
# Now generate a raster highlighting the original values plus immediate neighbors
# (By default ngb yields a queen-esque weighting system)
r.neighbor <- focal(r, ngb=3, fun=max)
# system.time() of the above function for a 100x100 raster takes 0.8 seconds on my laptop
# and takes over 15 seconds for a 1000x1000 raster
Ideally, I'd like to do this faster and for much larger rasters.
Much thanks,
Nick
Ps. There's some interesting discussion of the massive speed differences across R functions for doing moving window operations on vectors here: http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html

Related

How can i reduce the running time when compute variogram for large SaptialPointsDataframe in R

I am a beginner in R and geostatistics. I have a large SpatialPointsDataframe (307,907 elements, 27.2MB). When I want to use this data to build a variogram, it needs a lot of time (R runs for a day without any results). I think this is probably because R normally does single-core processing. Thus, is there a way for R to work with multiple cores in building a variogram? Is the function foreach can do this?
library(reader)
library(gstat)
library(sp)
coordinates(data_del) <- ~X1+X2
variogram <- variogram(X3~1, data_del, cutoff = 100, width =10)

How slow is too slow when kriging with gstat in R

I am trying to use the krige function in the gstat package of R to interpolate some spatial ocean depth data in R. I am finding for more than about ~1000 points, the function starts taking unreasonable amounts of time to finish (i.e., hours to days to hasn't ever finished). Is this normal or am I doing something wrong? I am particularly concerned because my eventual goal is to do spatio-temporal kriging of a very large dataset (>30,000 data points) and I am worried that it just won't be feasible given these run times.
I am running gstat-1.1-3 and R-3.3.2. Below is the code I am running:
library(sp); library(raster); library(gstat)
v.utm # SpatialPointsDataFrame with >30,000 points
# Remove points with identical positons
zd = zerodist(v.utm)
nzd = v.utm[-zd[,1],] # Layer with no identical positions
# Make a raster layer covering point layer
resolution=1e4
e = extent(as.matrix(v.utm#coords))+resolution
r = raster(e,resolution=resolution)
proj4string(r) = proj4string(v.utm)
# r is a 181x157 raster
# Fit variogram
fv = fit.variogram(variogram(AVGDEPTH~1, nzd),model=vgm(6000,"Exp",1,5e5,1))
# Krige on random sample of 500 points - works fine
size=500
ss=nzd[sample.int(nrow(nzd),size),]
depth.krig = krige(AVGDEPTH~1,ss,as(r,"SpatialPixelsDataFrame"),
model=depth.fit)
# Krige on random sample of 5000 points - never seems to end
size=5000
ss=nzd[sample.int(nrow(nzd),size),]
depth.krig = krige(AVGDEPTH~1,ss,as(r,"SpatialPixelsDataFrame"),
model=depth.fit)
The complexity of the choleski decomposition (or similar) is O(n^3), meaning that if you multiply the number of points by 10, the time it will take increases with a factor 1000. There are two ways out of this problem, at least for as far as gstat is concerned:
install an optimized version of BLAS (eg OpenBLAS, or MKL) - this does not solve the O(n^3) problem, but may speed up maximally a factor n with n the number of cores available
Avoid decomposing the full covariance matrix by choosing local neighbourhoods (arguments maxdist and/or nmax)
A much faster alternative to kriging for large datasets is griddify in the marmap package. It took me a while to find this, but it works well. It uses bilinear interpolation and although it is designed for bathymetric maps, it works with any xyz data.

R dynamic time warping for long time series

I'm trying to calculate the dtw distance for very long time series but I get an error that shows I cannot allocate memory for the matrix.
Here what I do:
library(dtw)
set.seed(1234)
N <- 300000
x <- rnorm(N)
y <- rnorm(N)
dtw(x,y,distance.only=TRUE)$distance
Error: cannot allocate vector of size 670.6 Gb
Is there an alternative way to calculate the dtw distance that does not need to allocate so much memory?
Idon't know this package , but From the companion paper of the package you have:
Larger problems may be addressed by approximate strategies, e.g.,
computing a preliminary alignment between downsampled time series
(Salvador and Chan 2004); indexing (Keogh and Ratanamahatana 2005); or
breaking one of the sequences into chunks and then iterating
subsequence matches.
The latter option can be implemented by something like :
lapply(split(y,1:100), ## I split y in 100 chnucks
function(z)dtw(x,z,distance.only=TRUE)$distance)
PS: By larger here , it means problems that exceed 8000 × 8000 points (close to the virtual memory limit) which it is your case here.

calculate mean for each cell in grid in R [duplicate]

Does somebody know whether there is a sliding window method in R for 2d matrices and not just vectors. I need to apply median function to an image stored in matrix
The function focal() in the excellent raster package is good for this. It takes several arguments beyond those shown in the example below, and can be used to specify a non-rectangular sliding window if that's needed.
library(raster)
## Create some example data
m <- matrix(1, ncol=10, nrow=10)
diag(m) <- 2
r <- as(m, "RasterLayer") # Coerce matrix to RasterLayer object
## Apply a function that returns a single value when passed values of cells
## in a 3-by-3 window surrounding each focal cell
rmean <- focal(r, w=matrix(1/9, ncol=3, nrow=3), fun=mean)
rmedian <- focal(r, w=matrix(1/9, ncol=3, nrow=3), fun=median)
## Plot the results to confirm that this behaves as you'd expect
par(mfcol=c(1,3))
plot(r)
plot(rmean)
plot(rmedian)
## Coerce results back to a matrix, if you so desire
mmean <- as(rmean, "matrix")
I know this is an old question, but I have come across this many times when looking to solve a similar problem. While the focal function in the raster package IS very straightforward and convienent, I have found it to be very slow when working with large rasters. There are many ways to try and address this, but one way that I found is by using system commands to "whitebox tools" which is a command-line driven set of raster analysis tools. It's main advantage it that it executes the tools in parallel and really takes advantage of multi-core CPUs. I know R has many cluster functions and packages (which I use for randomforest model raster prediction), but I have struggled with much of the parallel computing implementation in R. Whitebox tools has discrete functions for mean, max, majority, median, etc... filters (not to mention loads of terrain processing tools which is great for DEM-centric analyses).
Some example code for how I implemented a modal or majority filter (3x3 window) in R in of a large classified land cover raster (nrow=3793, ncol=6789, ncell=25750677) using whitebox tools:
system('C:/WBT2/target/release/whitebox_tools --wd="D:/Temp" ^
--run=MajorityFilter -v --input="input_rast.tif" ^
--output="maj_filt_rast.tif" --filterx=3 --filtery=3',
wait = T, timeout=0, show.output.on.console = T)
The above code took less than 3.5 seconds to execute, meanwhile the equivalent raster package "focal" function using "modal", also from the raster package, took 5 minutes to complete coded below as:
maj_filt_rast<- focal(input_rast, fun=modal, w=matrix(1,nrow=3,ncol=3))
Getting whitebox tools compiled and installed IS a bit annoying, but good instructions are provided. In my opinion it is well worth the effort as it makes raster processes that were previously prohibitively slow in R run amazingly fast and it allows me to keep the coding for everything inside R with system commands.

Function and data format for doing vector-based clustering in R

I need to run clustering on the correlations of data row vectors, that is, instead of using individual variables as clustering predictor variables, I intend to use the correlations between the vector of variables between data rows.
Is there a function in R that does vector-based clustering. If not and I need to do it manually, what is the right data format to feed in a function such as cmeans or kmeans?
Say, I have m variables and n data rows, the m variables constitute one vector for each data row. so I have a n X n matrix for correlation or cosine. Can this matrix be plugged in the clustering function directly or certain processing is required?
Many thanks.
You can transform your correlation matrix into a dissimilarity matrix,
for instance 1-cor(x) (or 2-cor(x) or 1-abs(cor(x))).
# Sample data
n <- 200
k <- 10
x <- matrix( rnorm(n*k), nr=k )
x <- x * row(x) # 10 dimensions, with less information in some of them
# Clustering
library(cluster)
r <- pam(1-cor(x), diss=TRUE, k=5)
# Check the results
plot(prcomp(t(x))$x[,1:2], col=r$clustering, pch=16, cex=3)
R clustering is often a bit limited. This is a design limitation of R, since it heavily relies on low-level C code for performance. The fast kmeans implementation included with R is an example of such a low-level code, that in turn is tied to using Euclidean distance.
There are a dozen of extensions and alternatives available in the community around R. There are PAM, CLARA and CLARANS for example. They aren't exactly k-means, but closely related. There should be a "spherical k-means" somewhere, that is sensible for cosine distance. There is the whole family of hierarchical clusterings (which scale rather badly - usually O(n^3), with O(n^2) in a few exceptions - but are very easy to understand conceptually).
If you want to explore some more clustering options, have a look at ELKI, it should allow clustering (with various methods, including k-means) by correlation based distances (and it also includes such distance functions). It's not R, though, but Java. So if you are bound to using R, it won't work for you.

Resources