Difference between distm function or the distVincentyEllipsoid in R - r

Could you fully explain the big difference in using the distm function or the distVincentyEllipsoid function to calculate the distance of geodesic coordinates in R?
I noticed that using distm for this calculation, it takes much longer. Could you please explain to me beyond the difference, why does this happen?
Thank you!

Following on from your previous question here: Distance calculation optimization in R
The speed relates to the level of computation required to produce the returned object, not necessarily the difference between the computation of distances (I am not sure what great circle computation the distm() function uses as it's default). Indeed the geosphere:: documentation here: https://cran.r-project.org/web/packages/geosphere/geosphere.pdf suggests that distVincentyEllipsoid() calculation is "very accurate" but "computationally more intensive" than other great circle methods while this would make you suspect a slower computation, it is because of the way I have structured the code in my answer to return a vector of distances between each row (not a matrix of distances between each and every point).
Conversely, your distm() calculation in your original code returns a matrix of multiple vectors between each and every point. For your problem, this is not necessary so long as the data is ordered, that is why I have done so. Additionally, the use of hierarchical clustering to cluster the points based on these distances into 3 (your defined number) clusters is also not necessary as we can use the percentile of distances between each point values to do the same. Again the speed benefit relates to computing the clusters on a single vector rather than a matrix.
Please note, I am a data analyst with a background in accounting/finance and not a GIS specialist by any means. That being said my use of the distVincentyEllipsoid() function comes from my general understanding that this returns a pretty accurate estimation of great circle distances as a vector (as a opposed to a matrix). Moreover, having used this in the past to optimise logistics operations for pricing purposes, I can attest to the fact these computations have been tested in the market and found to be sound.

Related

Different bandwidth specification in mean-shift clustering with different packages in R

I want to perform mean-shift clustering in R and found out that there are at least two packages that have this functionality: MeanShift and meanShiftR. As showed here the latter is much faster and as I tried out the first one and it took a long time to perform a clustering, I'm keen on choosing meanShiftR. However meanShiftR::meanShift function has rather uncommon way of bandwidth specification, see part of documentation:
queryData A matrix or vector of points to be classified by the mean
shift algorithm. Values must be finite and non-missing.
bandwidth A vector of length equal to the number of columns in the queryData matrix, or length one when queryData is a vector. This
value will be used in the kernel density estimate for steepest ascent
classification. The default is one for each dimension.
I'm not an expert in mean-shift clustering, but the only banwidth specifications I have found in the literature is that bandwidth is scalar or positive definite, symmetric matrix, not a vector. So is this the technical trick to represent the bandwidth and the value of bandwidth have to be the same for each dimension? Or maybe it can vary?
The other issue is that even setting the same value of bandwidth in meanShiftR package as in MeanShift::msClustering, but just replicated to match the number of columns, I've obtained totally different results, in particular much larger number of cluster. Also, the modes were rather very similar and not representative of the dataset. That made me wonder if this package works correct. Have someone even used meanShiftR? If so, maybe you could present any example as the documentation is not clear enough for me?
This isn't actually different.
One scalar per query point.

8 point algorithm for estimating Fundamental Matrix

I'm watching a lecture about estimating the fundamental matrix for use in stereo vision using the 8 point algorithm. I understand that once we recover the fundamental matrix between two cameras we can compute the epipolar line on one camera given a point on the other. To my understanding this epipolar line (after it's been rectified) makes it easy to find feature correspondences, because we are simply matching features along a 1D line.
The confusion comes from the fact that 8-point algorithm itself requires at least 8 feature correspondences to estimate the Fundamental Matrix.
So, we are finding point correspondences to recover a matrix that is used to find point correspondences?
This seems like a chicken-egg paradox so I guess I'm misunderstanding something.
The fundamental matrix can be precomputed. This leads to two advantages:
You can use a nice environment in which features can be matched easily (like using a chessboard) to compute the fundamental matrix.
You can use more computationally expensive operations like a sequence of SIFT, FLANN and RANSAC across the entire image since you only need to do that once.
After getting the fundamental matrix, you can find correspondences in a noisy environment more efficiently than using the same method when you compute the fundamental matrix.

Fast way of doing k means clustering on binary vectors in c++

I want to cluster binary vectors (millions of them) into k clusters.I am using hamming distance for finding the nearest neighbors to initial clusters (which is very slow as well). I think K-means clustering does not really fit here. The problem is in calculating mean of the nearest neighbors (which are binary vectors) to some initial cluster center, to update the centroid.
A second option is to use K-medoids in which the new cluster center is chosen from one of the nearest neighbors ( the one which is closest to all neighbors for a particular cluster center). But finding that is another problem because numbers of nearest neighbors are also quite large.
Can someone please guide me?
It is possible to do k-means with clustering with binary feature vectors. The paper called TopSig I co-authored has the details. The centroids are calculated by taking the most frequently occurring bit in each dimension. The TopSig paper applied this to document clustering where we had binary feature vectors created by random projection of sparse high dimensional bag-of-words feature vectors. There is an implementation in java at http://ktree.sf.net. We are currently working on a C++ version but it is very early code which is still messy, and probably contains bugs, but you can find it at http://github.com/cmdevries/LMW-tree. If you have any questions, please feel free to contact me at chris#de-vries.id.au.
If you are wanting to cluster a lot of binary vectors there are also more scalable tree based clustering algorithms of K-tree, TSVQ and EM-tree. For more details related to these algorithms you can see a paper I have recently submitted for peer review that is not yet published relating to the EM-tree.
Indeed k-means is not too appropriate here, because the means won't be reasonable on binary data.
Why do you need exactly k clusters? This will likely mean that some vectors won't fit to their clusters very well.
Some stuff you could look into for clustering: minhash, locality sensitive hashing.

dist function with large number of points

I am using the dist {stats} function to calculate the distance between points, my problem is that I have 24469 points, and the output for the dist function gives me a vector with 18705786 length, instead of the matrix. I tried already to export as.matrix, but the file is 2 large.
How can I have access to what points corresponds each distance?
For example which(distance<=700) gives me the position in the vector, but how can I get the info to what points this distance corresponds to?
There are asome things you could try, also depending on what you need exactly:
Calculate the distances in a loop, and only keep those that match the criterium. Especially when the number of matches is much smaller than the total size of the distance matrix, this saves a lot of RAM usage. This loop is probably very slow if it is implemented in pure R, that is alos why dist does not use R but I believe C to perform the calculations. This could mean that you get your results, but have to wait a while. Alternatively, the excellent Rcpp package would allow you to write this down in C/C++, making it much much faster probably.
Start using packages like bigmemory in storing the distance matrix. You then build it in a loop and store it iteratively in the bigmemory object (I have not worked with bigmemory before, so I don't know the exact details). Then after building the matrix, you can access it to extract your desired results. Effectively, all tricks to handle large data in R apply to this bullet. See e.g. R SO posts on big data.
Some interesting links (found googling for r distance matrix for large vector):
Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices
(lucky you!) http://stevemosher.wordpress.com/2012/04/08/using-bigmemory-for-a-distance-matrix/

What is the meaning of "Inf" in S_Dbw output in R commander?

I have ran clv package which consists of S_Dbw and SD validity indexes for clustering purposes in R commander. (http://cran.r-project.org/web/packages/clv/index.html)
I evaluated my clustering results from DBSCAN, K-Means, Kohonen algorithms with S_Dbw index. but for all these three algorithms S_Dbw is "Inf".
Is it "Infinite" meaning? Why did i confront with "Inf". Is there any problem in my clustering results?
In general, when is S_Dbw index result "Inf"?
Be careful when comparing different algorithms with such an index.
The reason is that the index is pretty much an algorithm in itself. One particular clustering will necessarily be the "best" for each index. The main difference between an index and an actual clustering algorithm is that the index doesn't tell you how to find the "best" solution.
Some examples: k-means minimizes the distances from cluster members to cluster centers. Single-link hierarchical clustering will find the partition with the optimal minimum distance between partitions. Well, DBSCAN will find the partitioning of the dataset, where all density-connected points are in the same partition. As such, DBSCAN is optimal - if you use the appropriate measure.
Seriously. Do not assume that because one algorithm scores higher than another in a particular measure means that the algorithm works better. All that you find out this way is that a particular algorithm is more (cor-)related to a particular measure. Think of it as a kind of correlation between the measure and the algorithm, on a conceptual level.
Using a measure for comparing different results of the same algorithm is different. Then obviously there shouldn't be a benefit from one algorithm over itself. There might still be a similar effect with respect to parameters. For example the in-cluster distances in k-means obviously should go down when you increase k.
In fact, many of the measures are not even well-defined on DBSCAN results. Because DBSCAN has the concept of noise points, which the indexes do not AFAIK.
Do not assume that the measure will either give you an indication of what is "true" or "correct". And even less, what is useful or new. Because you should be using cluster analysis not to find a mathematical optimum of a particular measure, but to learn something new and useful about your data. Which probably is not some measure number.
Back to the indices. They usually are totally designed around k-means. From a short look at S_Dbw I have the impression that the moment one "cluster" consists of a single object (e.g. a noise object in DBSCAN), the value will become infinity - aka: undefined. It seems as if the authors of that index did not consider this corner case, but only used it on toy data sets where such situations did not arise. The R implementation can't fix this, without diverting from the original index and instead turning it into yet another index. Handling noise objects and singletons is far from trivial. I have not yet seen an index that doesn't fail in one way or another - typically, a solution such as "all objects are noise" will either score perfect, or every clustering can trivially be improved by putting each noise object to the nearest non-singleton cluster. If you want your algorithm to be able to say "this object doesn't belong to any cluster" then I do not know any appropriate index.
The IEEE floating point standard defines Inf and -Inf as positive and negative infinity respectively. It means your result was too large to represent in the given number of bits.

Resources