How to create a Large Distance Matrix? - r

How to allocate a huge distance matrix in an appropriate way to avoid "allocation is
unable" error. Imagine you have a 100.000 points randomly spreaded over some
space. How can one cleverly create a matrix or "dist"-object, which represents the
the half of DistMatrix. Maybe it should be another object, which will be able efficiently allocate the large number of distances.
You can get the polygonial object from the following link:
https://www.dropbox.com/sh/65c3rke0gi4d8pb/LAKJWhwm-l
# Load required packages
library(sp)
library(maptools)
library(maps)
# Load the polygonal object
x <- readShapePoly("vg250_gem.shp")
# Sample or Pick up a large Number of Points
# this command needs some minutes to be done.
# "coord" is SpatialPoints Object
n <- 1e5
coord <- spsample(x, n, "random")
# Try to measure the distances by dist()
DistMatrix <- dist(coord#coords)
Error: negative length vectors are not allowed
# Try to measure the distances by spDists()
DistMatrix <- spDists(coord)
Error: cannot allocate vector of size (some number) MB
# It seems that the problem lies on large matrix to be created.
How is this problem solvable in R for great numbers of "n".

At this point R cannot allocate the random number of megabytes of RAM. At this point, your computer is using all of its memory somewhere else and there just isn't (some number) of MBytes available for your process to continue. You have several solutions at this point. Among them, get a machine with more RAM, close programs, or do your distance calculations in smaller batches. Try a smaller n; and when it works just repeat the process several times until you have your whole matrix of distances.

Related

Memory issue with K-means clustering

I'm trying to cluster key phrases from a search history using K means clustering, but I run into the error "cannot allocate vector of size 30gb" when I run the stringdistmatrix() command. The dataset I am using includes 63455 unique elements, so the resulting matrix requires about 30gb of memory to process. Is there a way to lower the requirements of the process without losing too much significance?
Below is the code I am attempting to run, if you happen to notice any other errors:
#Set data source, format for use, check consistency
MyData <- c('Create company email', 'email for business', 'free trial', 'corporate pricing', 'email cost')
summary(MyData)
#Define number of clusters
kclusters = round(0.90 * length(unique(MyData)))
#Compute distance between words
uniquedata <- unique(as.character(MyData))
distancemodels <- stringdistmatrix(uniquedata, uniquedata, method="jw")
#Create Dendrogram
rownames(distancemodels) <- uniquedata
hc <- hclust(as.dist(distancemodels))
par(mar = rep(2, 4))
plot(hc)
#Create clusters from grouped keywords
dfClust <- data.frame(uniquedata, cutree(hc, k=kclusters))
names(dfClust) <- c('data','cluster')
plot(table(dfClust$cluster))
#End view
view(dfClust)
I don't know of any way to avoid generating the distance matrix when doing k-means clustering.
You could consider alternative clustering algorithms that have been devised to avoid memory issues. The main one that comes to mind is CLARA (Clustering Large Applications; Kaufman and Rousseeuw 1990). In R, it's as simple as cluster::clara, taking numeric data only (like k-means) and requiring you to set k in advance.
Read the manual (?cluster::clara) especially on number of samples which you should set higher than the default. Hope that helps!
edit: just noticed you don't actually have numeric data to start with, so perhaps CLARA is not all that helpful. You could perhaps use some of the same principles as CLARA, including sampling your data multiple times to reduce the memory footprint and combining results later on.

How to Chunking large dissimilarity / distance matrices in R?

I would like to cluster mix-type data that contains 50k rows and 10 features/columns. I am using R in my 64 bit PC. When I calculate dissimilarity / distance matrix with "daisy" function, I got "Error: cannot allocate vector of size X GB" error.
gower_dist <- daisy(df, metric = "gower").
This is the command to generate distance matrix. How to handle this script with chunks to avoid RAM error ?

How slow is too slow when kriging with gstat in R

I am trying to use the krige function in the gstat package of R to interpolate some spatial ocean depth data in R. I am finding for more than about ~1000 points, the function starts taking unreasonable amounts of time to finish (i.e., hours to days to hasn't ever finished). Is this normal or am I doing something wrong? I am particularly concerned because my eventual goal is to do spatio-temporal kriging of a very large dataset (>30,000 data points) and I am worried that it just won't be feasible given these run times.
I am running gstat-1.1-3 and R-3.3.2. Below is the code I am running:
library(sp); library(raster); library(gstat)
v.utm # SpatialPointsDataFrame with >30,000 points
# Remove points with identical positons
zd = zerodist(v.utm)
nzd = v.utm[-zd[,1],] # Layer with no identical positions
# Make a raster layer covering point layer
resolution=1e4
e = extent(as.matrix(v.utm#coords))+resolution
r = raster(e,resolution=resolution)
proj4string(r) = proj4string(v.utm)
# r is a 181x157 raster
# Fit variogram
fv = fit.variogram(variogram(AVGDEPTH~1, nzd),model=vgm(6000,"Exp",1,5e5,1))
# Krige on random sample of 500 points - works fine
size=500
ss=nzd[sample.int(nrow(nzd),size),]
depth.krig = krige(AVGDEPTH~1,ss,as(r,"SpatialPixelsDataFrame"),
model=depth.fit)
# Krige on random sample of 5000 points - never seems to end
size=5000
ss=nzd[sample.int(nrow(nzd),size),]
depth.krig = krige(AVGDEPTH~1,ss,as(r,"SpatialPixelsDataFrame"),
model=depth.fit)
The complexity of the choleski decomposition (or similar) is O(n^3), meaning that if you multiply the number of points by 10, the time it will take increases with a factor 1000. There are two ways out of this problem, at least for as far as gstat is concerned:
install an optimized version of BLAS (eg OpenBLAS, or MKL) - this does not solve the O(n^3) problem, but may speed up maximally a factor n with n the number of cores available
Avoid decomposing the full covariance matrix by choosing local neighbourhoods (arguments maxdist and/or nmax)
A much faster alternative to kriging for large datasets is griddify in the marmap package. It took me a while to find this, but it works well. It uses bilinear interpolation and although it is designed for bathymetric maps, it works with any xyz data.

R dynamic time warping for long time series

I'm trying to calculate the dtw distance for very long time series but I get an error that shows I cannot allocate memory for the matrix.
Here what I do:
library(dtw)
set.seed(1234)
N <- 300000
x <- rnorm(N)
y <- rnorm(N)
dtw(x,y,distance.only=TRUE)$distance
Error: cannot allocate vector of size 670.6 Gb
Is there an alternative way to calculate the dtw distance that does not need to allocate so much memory?
Idon't know this package , but From the companion paper of the package you have:
Larger problems may be addressed by approximate strategies, e.g.,
computing a preliminary alignment between downsampled time series
(Salvador and Chan 2004); indexing (Keogh and Ratanamahatana 2005); or
breaking one of the sequences into chunks and then iterating
subsequence matches.
The latter option can be implemented by something like :
lapply(split(y,1:100), ## I split y in 100 chnucks
function(z)dtw(x,z,distance.only=TRUE)$distance)
PS: By larger here , it means problems that exceed 8000 × 8000 points (close to the virtual memory limit) which it is your case here.

Efficient moving window statistics for matrix and/or spatial data (neighborhood statistics) in R

I'm using the raster and related packages in R to do a bit of remote sensing work. For a number of the functions I'm writing, I'd love to rapidly compute neighborhood / moving window statistics. Unfortunately, any R implementations I or others write are very, very slow.
I know that the caTools package offers this functionality written in C for vectors / time series, which yields a 10X+ time savings. Is anyone familiar with a similar package or function that provides this functionality for matrices and spatial data?
Quick example:
# Generate a raster with random values
r <- raster(nrows=100, ncols=100)
values(r) <- rbinom(dim(r)[1] * dim(r)[2], 1, 0.1)
# Now generate a raster highlighting the original values plus immediate neighbors
# (By default ngb yields a queen-esque weighting system)
r.neighbor <- focal(r, ngb=3, fun=max)
# system.time() of the above function for a 100x100 raster takes 0.8 seconds on my laptop
# and takes over 15 seconds for a 1000x1000 raster
Ideally, I'd like to do this faster and for much larger rasters.
Much thanks,
Nick
Ps. There's some interesting discussion of the massive speed differences across R functions for doing moving window operations on vectors here: http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html

Resources