dist function with large number of points - r

I am using the dist {stats} function to calculate the distance between points, my problem is that I have 24469 points, and the output for the dist function gives me a vector with 18705786 length, instead of the matrix. I tried already to export as.matrix, but the file is 2 large.
How can I have access to what points corresponds each distance?
For example which(distance<=700) gives me the position in the vector, but how can I get the info to what points this distance corresponds to?

There are asome things you could try, also depending on what you need exactly:
Calculate the distances in a loop, and only keep those that match the criterium. Especially when the number of matches is much smaller than the total size of the distance matrix, this saves a lot of RAM usage. This loop is probably very slow if it is implemented in pure R, that is alos why dist does not use R but I believe C to perform the calculations. This could mean that you get your results, but have to wait a while. Alternatively, the excellent Rcpp package would allow you to write this down in C/C++, making it much much faster probably.
Start using packages like bigmemory in storing the distance matrix. You then build it in a loop and store it iteratively in the bigmemory object (I have not worked with bigmemory before, so I don't know the exact details). Then after building the matrix, you can access it to extract your desired results. Effectively, all tricks to handle large data in R apply to this bullet. See e.g. R SO posts on big data.
Some interesting links (found googling for r distance matrix for large vector):
Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices
(lucky you!) http://stevemosher.wordpress.com/2012/04/08/using-bigmemory-for-a-distance-matrix/

Related

Is it possible in R to calculate all eigenvalues of a very large symmetric n by n dense matrix in blocks to conserve RAM?

To provide some context, I work with DNA methylation data that even after some filtering can still consist of 200K-300K features (with much less samples, about 500). I need to do some operations on this and I have been using the bigstatsr package for other operations, which can use a Filebacked Big Matrix (FBM) to determine for instance a crossproduct in blocks. I further found that this can work with RSpectra::eigs_sym to get a specified number of eigenvalues, but unfortunately not all. To get all eigenvalues I have mainly seen the base R eigen function being used, but with this I run out of RAM when I have a matrix that is 300k by 300k.

Is exactextract taking shortcuts and sacrificing accuracy when calculating zonal statistics in R?

I've been calculating zonal statistics in R, first using the raster::extract function and then using the exactextractr::exactextract function. I compared the results of both to the results of zonal statistics I calculated by hand in QGIS. The results from the raster extract function match the QGIS results to several decimal places, whereas the exact extract function provides results that are close but a bit off:
QGIS Results
Raster Extract
Exact Extract
44.08599
44.08599
44.23548
56.82178
56.82178
56.90371
41.57019
41.57019
41.69187
55.97451
55.97451
56.02464
The pro to using exact extract is that it is MUCH MUCH faster than the raster extract function - but at what cost? Is the increased speed a result of "cutting corners" and less accurate results? And if so, exactly how much accuracy am I losing? I'm trying to determine if it's worth saving the time if I end up with worse results.
exactextractr is faster and more precise because it literally cuts corners (it considers fractions of raster cells, not only entire cells). terra::extract is also reasonably fast.

Difference between distm function or the distVincentyEllipsoid in R

Could you fully explain the big difference in using the distm function or the distVincentyEllipsoid function to calculate the distance of geodesic coordinates in R?
I noticed that using distm for this calculation, it takes much longer. Could you please explain to me beyond the difference, why does this happen?
Thank you!
Following on from your previous question here: Distance calculation optimization in R
The speed relates to the level of computation required to produce the returned object, not necessarily the difference between the computation of distances (I am not sure what great circle computation the distm() function uses as it's default). Indeed the geosphere:: documentation here: https://cran.r-project.org/web/packages/geosphere/geosphere.pdf suggests that distVincentyEllipsoid() calculation is "very accurate" but "computationally more intensive" than other great circle methods while this would make you suspect a slower computation, it is because of the way I have structured the code in my answer to return a vector of distances between each row (not a matrix of distances between each and every point).
Conversely, your distm() calculation in your original code returns a matrix of multiple vectors between each and every point. For your problem, this is not necessary so long as the data is ordered, that is why I have done so. Additionally, the use of hierarchical clustering to cluster the points based on these distances into 3 (your defined number) clusters is also not necessary as we can use the percentile of distances between each point values to do the same. Again the speed benefit relates to computing the clusters on a single vector rather than a matrix.
Please note, I am a data analyst with a background in accounting/finance and not a GIS specialist by any means. That being said my use of the distVincentyEllipsoid() function comes from my general understanding that this returns a pretty accurate estimation of great circle distances as a vector (as a opposed to a matrix). Moreover, having used this in the past to optimise logistics operations for pricing purposes, I can attest to the fact these computations have been tested in the market and found to be sound.

R how to create a large matrix by combining small blocks of matrix

I'm working on a constrained optimization problem using Lagrange Multiplier method. And I'm trying to build this huge sparse matrix in R in order to calculate the values.
Here's how the matrices would look like. And the link below for the details of the problem if needed.
Implementation of Lagrange Multiplier to solve constrained optimization problem.
Here's the code I've come up with, sorry if my approach seems clumsy to you, since I'm new to matrix manipulation and programming.
First, I imported the 3154 by 30 matrices from a csv file, and then combined all columns into one. Then I created a diagonal matrices to imitate the upper left corner of the matrices.
Then, to imitate the lower left corner of the matrices. I created a 3154x3154 identity matrices and tried to replicate it 30 times.
I have two questions here
When I tried to cbind the diagonal sparse matrix, it returned a combination of two lists instead of a matrix. So I had to convert it to matrix, but this is taking too much of my memory. I'd like to know if there's a better way to accomplish this.
I want to know if there's a formula for cbind a matrix multiple times. Since I need to replicate the matrix 30 times. I'm curious if there's a cleaner way to get around all the typings. (This was solved thanks to #Jthorpe)
I was gonna do the same thing for the rest of the matrices. I know this is not the best approach to tackle this problem. Please feel free to to suggest any smarter way of doing this. Thanks!
library(Matrix)
dist_data=read.csv("/Users/xxxxx/dist_mat.csv", header=T)
c=ncol(dist_data) #number of cluster - 30
n=nrow(dist_data) #number of observations - 3153
#Create a c*n+c+n = 3153*30+3153+3 = 97,773 coefficient matrix
dist_list=cbind(unlist(dist_data))
Coeff_mat=2*.sparseDiagonal(c*n,x = c(dist_list))
diag=.sparseDiagonal(n)
Uin <- do.call(cbind,rep(list(as.matrix(diag)),30))

dist() function in R: vector size limitation

I was trying to draw a hierarchical clustering of some samples (40 of them) over some features(genes) and I have a big table with 500k rows and 41 columns (1st one is name) and when I tried
d<-dist(as.matrix(file),method="euclidean")
I got this error
Error: cannot allocate vector of size 1101.1 Gb
How can I get around of this limitation? I googled it and came across to the ff package in R but I don't quite understand whether that could solve my issue.
Thanks!
Generally speaking hierarchical clustering is not the best approach for dealing with very large datasets.
In your case however there is a different problem. If you want to cluster samples structure of your data is wrong. Observations should be represented as the rows, and gene expression (or whatever kind of data you have) as the columns.
Lets assume you have data like this:
data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))
What you want to do is:
# Create transposed data matrix
data.matrix.t <- t(as.matrix(data))
# Create distance matrix
dists <- dist(data.matrix.t)
# Clustering
hcl <- hclust(dists)
# Plot
plot(hcl)
NOTE
You should remember that euclidean distances can be rather misleading when you work with high-dimensional data.
When dealing with large data sets, R is not the best choice.
The majority of methods in R seems to be implemented by computing a full distance matrix, which inherently needs O(n^2) memory and runtime. Matrix based implementations don't scale well to large data , unless the matrix is sparse (which a distance matrix per definition isn't).
I don't know if you realized that 1101.1 Gb is 1 Terabyte. I don't think you have that much RAM, and you probably won't have the time to wait for computing this matrix either.
For example ELKI is much more powerful for clustering, as you can enable index structures to accelerate many algorithms. This saves both memory (usually down to linear memory usage; for storing the cluster assignments) and runtime (usually down to O(n log n), one O(log n) operation per object).
But of course, it also varies from algorithm to algorithm. K-means for example, which needs point-to-mean distances only, does not need (and cannot use) an O(n^2) distance matrix.
So in the end: I don't think the memory limit of R is your actual problem. The method you want to use doesn't scale.
I just experience a related issue but with less rows (around 100 thousands for 16 columns).
RAM size is the limiting factor.
To limitate the need in memory space I used 2 different functions from 2 different packages.
from parallelDist the function parDist() allow you to obtain the distances quite fast. it uses RAM of course during the process but it seems that the resulting dist object is taking less memory (no idea why).
Then I used the hclust() function but from the package fastcluster. fastcluster is actually not so fast on such an amount of data but it seems that it uses less memory than the default hclust().
Hope this will be useful for anybody who find this topic.

Resources