R dynamic time warping for long time series - r

I'm trying to calculate the dtw distance for very long time series but I get an error that shows I cannot allocate memory for the matrix.
Here what I do:
library(dtw)
set.seed(1234)
N <- 300000
x <- rnorm(N)
y <- rnorm(N)
dtw(x,y,distance.only=TRUE)$distance
Error: cannot allocate vector of size 670.6 Gb
Is there an alternative way to calculate the dtw distance that does not need to allocate so much memory?

Idon't know this package , but From the companion paper of the package you have:
Larger problems may be addressed by approximate strategies, e.g.,
computing a preliminary alignment between downsampled time series
(Salvador and Chan 2004); indexing (Keogh and Ratanamahatana 2005); or
breaking one of the sequences into chunks and then iterating
subsequence matches.
The latter option can be implemented by something like :
lapply(split(y,1:100), ## I split y in 100 chnucks
function(z)dtw(x,z,distance.only=TRUE)$distance)
PS: By larger here , it means problems that exceed 8000 × 8000 points (close to the virtual memory limit) which it is your case here.

Related

Competing risk survival random forest with large data

I have a data set with 500,000 observations with events and a competing risk as well as a time-to-event variable (survival analysis).
I want to run a survival random forest.
The R-package randomForestSRC is great for it, however, it is impossible to use more than 100,000 rows due to memory limitation (100'000 uses 40GB of RAM) even though I limit my number of predictors to 15 to 20.
I have a hard time finding a solution. Does anyone have a recommendation?
I looked at h2o and spark mllib, both of which do not support survival random forests.
Ideally I am looking for a somewhat R-based solution but I am happy to explore anything else if anyone knows a way to use large data + competing risk random forest.
In general, the memory profile for an RF-SRC data set is n x p x 8 on a 64-bit machine. With n=500,000 and p=20, RAM usage is approximately 80MB. This is not large.
You also need to consider the size of the forest, $nativeArray. With the default nodesize = 3, you will have n / 3 = 166,667 terminal nodes. Assuming symmetric trees for convenience, the total number of interanal/external nodes will approximately be 2 * n / 3 = 333,333. With the default ntree = 1000, and assuming no factors, $nativeArray will be of dimensions [2 * n / 3 * ntree] x [5]. A simple example will show you why we have [5] columns in the $nativeArray to tag the split parameter, and split value. Memory usage for the forest will be thus be 2 * n / 3 * ntree * 5 * 8 = 1.67GB.
So now we are getting into some serious memory usage.
Next consider the ensembles. You haven't said how many events you have in your competing risk data set, but let's assume there are two for simplicity.
The big arrays here are the cause-specific hazard function (CSH) and the cause-specific cumulative incidence function (CIF). These are both of dimension [n] x [time.interest] x [2]. In a worst case scenario, if all your times are distinct, and there are no censored events, time.interest = n. So each of these outputs is n * n * 2 * 8 bytes. This will blow up most machines. It's time.interest that is your enemy. In big-n situations, you need to constrain the time.interest vector to a subset of the actual event times. This can be controlled with the parameter ntime.
From the documentation:
ntime: Integer value used for survival families to constrain ensemble calculations to a grid of time values of no more than ntime time points. Alternatively if a vector of values of length greater than one is supplied, it is assumed these are the time points to be used to constrain the calculations (note that the constrained time points used will be the observed event times closest to the user supplied time points). If no value is specified, the default action is to use all observed event times.
My suggestion would be to start with a very small value of ntime, just to test whether the data set can be analyzed in its entirety without issue. Then increase it gradually and observe your RAM usage. Note that if you have missing data, then RAM usage will be much larger. Also note that I did not mention other arrays such as the terminal node statistics that also lead to heavy RAM usage. I only considered the ensembles, but the reality is that each terminal node will contain arrays of dimension [time.interest] x 2 for the node specific estimator of the CSH and CIF that is used in creating the forest ensemble.
In the future, we will be implementing a Big Data option that will suppress ensembles and optimize the memory profile of the package to better accommodate big-n scenarios. In the meantime, you will have to be diligent in using the existing options like ntree, nodesize, and ntime to reduce your RAM usage.

How slow is too slow when kriging with gstat in R

I am trying to use the krige function in the gstat package of R to interpolate some spatial ocean depth data in R. I am finding for more than about ~1000 points, the function starts taking unreasonable amounts of time to finish (i.e., hours to days to hasn't ever finished). Is this normal or am I doing something wrong? I am particularly concerned because my eventual goal is to do spatio-temporal kriging of a very large dataset (>30,000 data points) and I am worried that it just won't be feasible given these run times.
I am running gstat-1.1-3 and R-3.3.2. Below is the code I am running:
library(sp); library(raster); library(gstat)
v.utm # SpatialPointsDataFrame with >30,000 points
# Remove points with identical positons
zd = zerodist(v.utm)
nzd = v.utm[-zd[,1],] # Layer with no identical positions
# Make a raster layer covering point layer
resolution=1e4
e = extent(as.matrix(v.utm#coords))+resolution
r = raster(e,resolution=resolution)
proj4string(r) = proj4string(v.utm)
# r is a 181x157 raster
# Fit variogram
fv = fit.variogram(variogram(AVGDEPTH~1, nzd),model=vgm(6000,"Exp",1,5e5,1))
# Krige on random sample of 500 points - works fine
size=500
ss=nzd[sample.int(nrow(nzd),size),]
depth.krig = krige(AVGDEPTH~1,ss,as(r,"SpatialPixelsDataFrame"),
model=depth.fit)
# Krige on random sample of 5000 points - never seems to end
size=5000
ss=nzd[sample.int(nrow(nzd),size),]
depth.krig = krige(AVGDEPTH~1,ss,as(r,"SpatialPixelsDataFrame"),
model=depth.fit)
The complexity of the choleski decomposition (or similar) is O(n^3), meaning that if you multiply the number of points by 10, the time it will take increases with a factor 1000. There are two ways out of this problem, at least for as far as gstat is concerned:
install an optimized version of BLAS (eg OpenBLAS, or MKL) - this does not solve the O(n^3) problem, but may speed up maximally a factor n with n the number of cores available
Avoid decomposing the full covariance matrix by choosing local neighbourhoods (arguments maxdist and/or nmax)
A much faster alternative to kriging for large datasets is griddify in the marmap package. It took me a while to find this, but it works well. It uses bilinear interpolation and although it is designed for bathymetric maps, it works with any xyz data.

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components.
My vectors are composed by one email and by 30 qualitative variables.
Each quantitative variable has 4 classes: 0,1,2 and 3.
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
k-means clustering in R on very large, sparse matrix? (bigkmeans)
Cluster Big Data in R and Is Sampling Relevant? (clara)
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.
For example, using the tea data set from FactoMineR:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.
The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.
That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with
rm(mydata, mydata2)
(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:
R memory management / cannot allocate vector of size n Mb
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

How to create a Large Distance Matrix?

How to allocate a huge distance matrix in an appropriate way to avoid "allocation is
unable" error. Imagine you have a 100.000 points randomly spreaded over some
space. How can one cleverly create a matrix or "dist"-object, which represents the
the half of DistMatrix. Maybe it should be another object, which will be able efficiently allocate the large number of distances.
You can get the polygonial object from the following link:
https://www.dropbox.com/sh/65c3rke0gi4d8pb/LAKJWhwm-l
# Load required packages
library(sp)
library(maptools)
library(maps)
# Load the polygonal object
x <- readShapePoly("vg250_gem.shp")
# Sample or Pick up a large Number of Points
# this command needs some minutes to be done.
# "coord" is SpatialPoints Object
n <- 1e5
coord <- spsample(x, n, "random")
# Try to measure the distances by dist()
DistMatrix <- dist(coord#coords)
Error: negative length vectors are not allowed
# Try to measure the distances by spDists()
DistMatrix <- spDists(coord)
Error: cannot allocate vector of size (some number) MB
# It seems that the problem lies on large matrix to be created.
How is this problem solvable in R for great numbers of "n".
At this point R cannot allocate the random number of megabytes of RAM. At this point, your computer is using all of its memory somewhere else and there just isn't (some number) of MBytes available for your process to continue. You have several solutions at this point. Among them, get a machine with more RAM, close programs, or do your distance calculations in smaller batches. Try a smaller n; and when it works just repeat the process several times until you have your whole matrix of distances.

Efficient moving window statistics for matrix and/or spatial data (neighborhood statistics) in R

I'm using the raster and related packages in R to do a bit of remote sensing work. For a number of the functions I'm writing, I'd love to rapidly compute neighborhood / moving window statistics. Unfortunately, any R implementations I or others write are very, very slow.
I know that the caTools package offers this functionality written in C for vectors / time series, which yields a 10X+ time savings. Is anyone familiar with a similar package or function that provides this functionality for matrices and spatial data?
Quick example:
# Generate a raster with random values
r <- raster(nrows=100, ncols=100)
values(r) <- rbinom(dim(r)[1] * dim(r)[2], 1, 0.1)
# Now generate a raster highlighting the original values plus immediate neighbors
# (By default ngb yields a queen-esque weighting system)
r.neighbor <- focal(r, ngb=3, fun=max)
# system.time() of the above function for a 100x100 raster takes 0.8 seconds on my laptop
# and takes over 15 seconds for a 1000x1000 raster
Ideally, I'd like to do this faster and for much larger rasters.
Much thanks,
Nick
Ps. There's some interesting discussion of the massive speed differences across R functions for doing moving window operations on vectors here: http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html

Resources