Saving and incrementally updating nearest-neighbor model in R - r

There are several nearest neighbor R packages (e.g., FNN, RANN, yaImpute) but none of them seem to allow saving off the NN data structure (cover tree, KD tree etc.) so that the nearest neighbors of new queries can be calculated without reconstructing the whole tree. Are there any such functions in R?
I am looking for a function that returns a data structure that I can update incrementally as new data arrives to perform approximate K nearest neighbor search.

There is a good reason why no NN package does that.
The reason is that the "NN data structure" necessarily includes all the input data points (in the form of a KD tree), so there is no space savings against the input data. It appears that there would be time savings in not having to re-create the KD-tree for each new input, but this is not the case, alas.
The reason is that the time to build a KD-tree is, in general, worse than linearithmic. This means that, for large inputs, it makes sense to sort the data before building the KD-tree because that will produce the KD-tree faster and it will be better balanced, which will improve the search too (it is also worse than logarithmic, in general). This approach would speed up modeling and evaluation but discourage incremental updates, of course.
Your best bet, I think, if to find a generic KD-tree package and use it instead.

The nabor package lets you build a tree and subsequently perform queries on it. But I don't think it lets you update the tree incrementally.

Related

Is calculating the index in an array more efficient than letting the compiler do it?

I'm trying to generalize a neural network function to arbitrarily many layers, and so I need multiple matrices to hold the weights for each neuron in each layer. I was originally explicitly declaring matrix objects in R to hold my weights for each layer. Instead of having one matrix per layer, I thought of a way (not saying it's original), to store all of my weights in a single array and defined an "indexing function" to map a weight to its appropriate index in the array.
I defined the function as follows:
where is the k-th weight of the j-th neuron in the i-th layer and L(r) is the number of neurons in layer r. After writing these definitions, I realize that stackoverflow doesn't allow latex like mathoverflow which is unfortunate.
Now the question is: Is it more efficient to compute the index of my weights in this way, or is actually less efficient?
After looking up how indices are computed for arrays in general, this is essentially what is done on compilation anyway if I just kept a matrix in each layer holding the weights, so it seems like I may just be making my code overly complicated and harder to understand if there's no difference in time efficiency.
TL;DR use the matrices its easier to understand and takes advantage of optimized CPU instructions.
In computer science parlance, the efficiency (scalability) of algorithms is reasoned about using Big O cost. A score can be given to both the time and space complexity.
Using Big O notation lets compare the two approaches:
Array Approach
time complexity:
Array index access is O(1) time, no matter how large an array becomes, it is just as computationally easy to access an element given its index.
As you've created a function to compute the index of the k-th weight, this adds some small complexity but would probably run in constant O(1) time as it is a mathematical expression, so negligible.
space complexity:
O(N) where N is the number of weights across all layers.
Matrices Approach
time complexity:
A matrix is essentially a 2d array with O(1) access
space complexity
O(N + M), where N is number of neurons and M is number of weights.
Conceptually, we can see that the two approaches have an equivalent time and space complexity score.
However there are the other trade-offs involved (and as a good SO-er must inform you of those)
When it comes to working with the data in the array vs matrices approach, the array approach is less efficient as it circumvents the opportunity for MISD operations. As #liborm alluded to there are vectorised (MISD) operations handled by lower level system libraries like LAPACK/BLAS, which "batch" CPU instructions for some matrix operations (less overhead cost to transfer and compute data at CPU compared to sending a new instruction every time)
Instead of having one matrix per layer, I thought of a way ... to store all of my weights in a single array
It's hard to see why you would opt-ed for the latter as it requires you to create a bespoke indexing function. Maybe its nicer to think about all your weights being in one long array place? However I would argue the mental load required to maintain the array mapping is higher than having multiple matrices dedicated to a layer.
A hash-table like structure of matrices would be much easier to reason about
layers <- list(layer1 = [[...]], layer2 = [[...]], layerN = [[...]])
Further reading
http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html
There are many factors to take into consideration in each of the approaches. I'm not familiar with R but I'm assuming matrices' buffers are represented as one-dimensional arrays in memory. (Even if they are written as two dimensional arrays in the underlying C implementation the compiler stores it as one-dimensional array in memory)
The overall outline of memory operations are:
Case: Several matrices per layers
Allocation of matrices:
Accessing of indices:
Case: One matrix for all layers + index calculation
Allocation of matrix cost:
Accesing each of the indices cost:
Function cost:
We can clearly see that the second case, scales better, even though there's the additional cost of the function call.
Having said that, in general having a statically allocated array with all the weights for all the layers, should be faster.
In most cases, computers's bottleneck is memory bandwidth, and the best way to counteract this is to minimize the number of memory accesses.
With this in mind there's another more primitive reason why the 2nd approach will probably be faster: Caches.
Here's a good explanation of the performance difference in accesing a two-dimensional array in a loop by Good Ol' Bob Martin
TL; DR: Caches take advantage of the principle of locality, and therefore, having memory accesses spatially close to each other (as you would in one single array and accessing them in a cache-friendly way as explained in Bob Martin's answer) renders better performance than having them spatially separated (having them in several distinct arrays).
PS: I also recommend to benchmark both approaches and compare, since these nuances regarding the cache are machine-dependent. It might be the case that the Dataset/NN is small enough to fit completely in RAM or even in cache? in a very powerful server.
I'm sure you want to use some kind of native array objects, so you get the speedups provided by BLAS/LAPACK implementations (see eg Intel MKL discussion here if you're on Windows). Most of the time in NN evaluation will be spent in matrix multiplications (like SGEMM), and this is where BLAS implementations like Intel MKL can be an order of magnitude faster.
That is - even if the hand-coded indices for your single-array multi-layer network were super fast, you won't be able to use it with the optimised multiplication routines, which would make your whole network significantly slower. Use the native array objects and create a multi layer abstraction on top of them.
But actually if you want speed and usability (and to really build some NN models), you should consider using something like R interface to TensorFlow. As a bonus you'll get things like running on the GPU for free.
Nice puzzle.. If you are asking calculating index in which would happen in runtime for which it needs to be compiled. Just want to understand how would you let the compiler compute it? IF you have a need to playing with the info anytime later then I would suggest to use Hasmap kind of mechanism. I had done it for a similar need.

Feature selection on subsets of feature set

I am trying to do the feature selection using Boruta package in R. The problem is that my feature set is way tooo large (70518 features) and therefore the dataframe is too large (2Gb) and cannot be processed by the Boruta package at once. I am wondering if I can split the data frame into several sets, each containing a smaller amount of features? This sounds a bit weird to me, as I am not sure if the algorithm can correctly identify the weights if not all features are present.
If not, I would be very grateful if someone can suggest an alternative way of doing it.
I think your best best in this case might be to first try and filter out some of the features that are either low information (e.g. ~zero variance) or highly correlated.
The caret package has some useful functions to help with this.
For example, the findCorrelation() can be used to easily remove redundant features:
dat <- cor(dat, method='spearman')
dat[is.na(dat)] <- 0
features_to_ignore <- findCorrelation(dat, cutoff=0.75, verbose=FALSE)
dat <- dat[,-features_to_ignore]
This will remove all features with a Spearman correlation of 0.75 or higher.
I'm going to start by asking why you believe that this can even work? In this case, not only is p >> n, but p >>>>>> n. You're always going to find spurious associations. More than that, even if you could do this (say by renting a sufficiently large machine in a cloud computing service, which is the method I'd suggest), you're looking at an absurd amount of computation, since the computational complexity of building a single decision tree is O(n * v log(v)), where n is the number of records and v is the number for fields in each record. Building an RF takes that much for each tree.
Instead of solving the problem as stated, you might want to rethink it from the ground up. What are you really trying to do here? Can you go back to first principles and rethink that?

Graph partitioning optimization

The problem
I have a set of locations on a plane (actually they are pins in a KML file) and I want to partition this graph into subgraphs. Connectivity is pretty good - as with all real world road networks - so I assume that if two locations are close they have some kind of connection. The resulting set of subgraphs should adhere to these constraints:
Every node has to be covered by a subgraph
Every node should be in exactly 1 subgraph
Every node within a subgraph should be close to each other (L2 norm distances)
Every subgraph should contain at least 5 locations
The amount of subgraphs should be minimal
Right now the amount of locations is no more than 100 so I thought about brute forcing through every possibility but this obviously won't scale well.
I thought about using some k-Nearest-Neighbors algorithm (e.g. using QuickGraph) but I can't get my head around where to start and how to extend/shrink the subgraphs on the way. Maybe it's possible to map this problem to another problem that can easily be solved with some numerical procedure (e.g. Simplex) ...
Maybe someone has experience in this kind of optimization problems and is willing to help me find a solution? I don't have access to Mathematica/Matlab or the like ... but sufficient .NET programming skills and hmm Excel :-)
Thanks a lot!
As soon as there are multiple criteria that need to be appeased in the best possible way simultanously, it is usually starting to get difficult.
A numerical solution could work as follows: You could define yourself a utility function, that maps partitionings of your locations to positive real values, describing how "good" a partition is by assigning it a "rating" (good could be high "bad" could be near zero).
Once you have such a function assigning partitions their according "values", you simply need to optimize it and then you hopefully obtain a good solution if you defined your utility function reasonably. Evolutionary algorithms are good at that task since your utility function is probably analytically too complex to solve due to its discrete nature.
The problem is then only how you assign "values" to partitions via this utility function. This is then your task. It can be done for example by weighing each criterion with a factor and summing the results up, or even more complex functions (least squares etc.). The factors you use in the definition of the utility function are tuning parameters and can be varied until the result seems to be good.
Some CA software wold help a lot for testing if you can get your hands on one, bit I guess to obtain a black box solver for your partitioning problem, you need to implement the complete procedure yourself using a language of your choice.

rapid exploring random trees

http://msl.cs.uiuc.edu/rrt/
Can anyone explain how rrt works with simple wording that is easy to understand?
I read the description in the site and in wikipedia.
What I would like to see, is a short implementation of a rrt or a thorough explanation of the following thing:
Why does the rrt grow outwards instead of just growing very dense around the center?
How is it different from a naive random tree?
How is the next new vertex that we attempt to reach picked?
I know there is an Motion Strategy Library I could download but I would much rather understand the idea before I delve into the code rather than the other way around.
The simplest possible RRT algorithm has been so successful because it is pretty easy to implement. Things tend to get complicated when you:
need to visualise planning concepts in more than two dimensions
are unfamiliar with the terminology associated with planning, and;
in the huge number of variants of RRT that are have been described in the literature.
Pseudo code
The basic algorithm looks something like this:
Start with an empty search tree
Add your initial location (configuration) to the search tree
while your search tree has not reached the goal (and you haven't run out of time)
3.1. Pick a location (configuration), q_r, (with some sampling strategy)
3.2. Find the vertex in the search tree closest to that random point, q_n
3.3. Try to add an edge (path) in the tree between q_n and q_r, if you can link them without a collision occurring.
Although that description is adequate, after a while working in this space, I really do prefer the pseudocode of figure 5.16 on RRT/RDT in Steven LaValle's book "Planning Algorithms".
Tree Structure
The reason that the tree ends up covering the entire search space (in most cases) is because of the combination of the sampling strategy, and always looking to connect from the nearest point in the tree. This effect is described as reducing the Voronoi bias.
Sampling Strategy
The choice of where to place the next vertex that you will attempt to connect to is the sampling problem. In simple cases, where search is low dimensional, uniform random placement (or uniform random placement biased toward the goal) works adequately. In high dimensional problems, or when motions are very complex (when joints have positions, velocities and accelerations), or configuration is difficult to control, sampling strategies for RRTs are still an open research area.
Libraries
The MSL library is a good starting point if you're really stuck on implementation, but it hasn't been actively maintained since 2003. A more up-to-date library is the Open Motion Planning Library (OMPL). You'll also need a good collision detection library.
Planning Terminology & Advice
From a terminology point of view, the hard bit is to realise that although lots of the diagrams you see in the (early years of) publications on RRT are in two dimensions (trees that link 2d points), that this is the absolute simplest case.
Typically, a mathematically rigorous way to describe complex physical situations is required. A good example of this is planning for a robot arm with n- linkages. Describing the end of such an arm requires a minimum of n joint angles. This set of minimum parameters to describe a position is a configuration (or some publications state). A single configuration is often denoted q
The combination of all possible configurations (or a subset thereof) that can be achieved make up a configuration space (or state space). This can be as simple as an unbounded 2d plane for a point in the plane, or incredibly complex combinations of ranges of other parameters.

R tree and Graph partitioning library in R

I need to make efficient d-dimensional points searching and also make efficient k-NN queries of a point in d-dimension. Therefore i require an R-Tree library. I require a library which will build the R-Tree structure, which i can use to query whenever needed.
Also i need to have some library like that of METIS or hMETIS, although my application does not involve hypergraphs. My requirement is to find the min cut set of a graph which divides the graph in roughly two equal sized graphs.
The thing is i would require libraries which support these in R.
I have found a library RANN, which has kd-tree based k-NN queries, but the problem is that either i have to make all the k-NN queries at once and store the results in a huge array, or need to call the function (nn or nn2) every time i need, which defeates the O(n lg n) retrieval growth of time.
Can anyone tell me if there is any such libraries in R?
Note: I would require the R-Tree library for implementing clustering algorithms efficiently, and the graph partition library would be required to implement the CHAMELEON clustering algorithm.
After some study on R and its libraries i think it is better to get the required libraries or make my own code in C or C++ and then use it through the .C() or .Call() R to C language interface.
Also i need to have some library like that of METIS or hMETIS, although my application does not involve hypergraphs. My requirement is to find the min cut set of a graph which divides the graph in roughly two equal sized graphs.
Despite that this is an old question, I have written something like this recently. That is,
A Kernighan-Lin like algorithm.
An algorithm to find an approximately connected balanced partition using the method suggested by Chlebíková (1996).
An algorithm that takes the solution found by the method in 2. and tries to minimize the cut price using a Kernighan-Lin like algorithm while still requiring that the two sets in the partition are connected.
From the graphs I am working with, 3. seems to often find a quite good solution often for bigger graphs (say ~ 1-4 million edges with ~ 1 million vertices). This takes seconds or a few minutes. The implementation is in the pedmod package at https://github.com/boennecd/pedmod. Call the following to install the package and to find a vignette with further details:
remotes::install_github("boennecd/pedmod", build_vignettes = TRUE)
vignette("pedigree_partitioning", package = "pedmod")
I am not sure how my implementation compares in terms of speed and quality of the partition compared with other software though.
References
Chlebíková, Janka. 1996. “Approximating the Maximally Balanced Connected Partition Problem in Graphs.” Information Processing Letters 60 (5): 225–30.
Kernighan, B. W., and S. Lin. 1970. “An Efficient Heuristic Procedure for Partitioning Graphs.” The Bell System Technical Journal 49 (2): 291–307

Resources