Handle embedding vectors similarities - vector

I need to compare one vector embedding against 50k (can grow to 100-200k) vector embeddings and find top 10 similarities.
I need to do it fast for online user interactions.
I have access to azure cloud.
How should I handle such scenario?
Where do I store all embeddings so my code can access and process them quickly?
Should I use python library and handle all in memory? Should I distribute the calculation somehow for speed and memory better usage?
Is there some graph db that handles this case and can compare vectors using KNN/ANN?
Whats the best approach for this usecase?

Related

How does OpenMPI's gather work?

I'm new to MPI and I'm trying to understand how MPI (and specifically OpenMPI) work in order to reason about the performance of my system.
I've tried to find resources online to help me understand things a little better, but haven't had much luck. I thought I'd come here.
Right now my question is simple: if I have 3 nodes (1 master, 2 clients) and I issue an MPI_Gather, does the root process handle incoming data sequentially or concurrently? In other words, if processes 1 is the first to make a connection with processes 0, will process 2 have to wait until processes 1 is done sending its data before it can start to send its data?
Thanks!
There are multiple components in Open MPI that implement collective operations and some of them provide multiple algorithms for the implementation of each operation.
What you are most likely interested in is the tuned component of the coll framework as that is what Open MPI uses by default. tuned implements all collectives using point-to-point operations and provides several algorithms for gather:
linear with synchronisation - used when messages are large to mid-size
binomial - used when the number of processes is large or the message size is small
basic linear - used in all other cases
The performance of each algorithm depends strongly on the particular combination of message size and number of ranks, therefore the library comes with a set of heuristics that tries to determine the best algorithm based on the data size and the size of the communicator (as indicated above). There are several mechanisms to override the heuristics and either force a certain algorithm or provide a list of custom algorithm selection rules.
The basic linear algorithm simply has the root loop over all other ranks receiving their messages in sequence. In that case, rank 2 won't be able to send its chunk before rank 1 since the root will first receive the message from rank 1 and only then move on to rank 2.
The linear with synchronisation algorithm splits the chunks into two pieces each. The first pieces are collected in sequence just like in the basic linear algorithm. The second pieces are collected asynchronously using non-blocking receives.
The binomial algorithm arranges the ranks as a binomial tree. The processes at the nodes of the tree receive the chunks from the lower levels and aggregate them into larger chunks that then get passed to the upper levels until they reach the root rank.
You can find the source code of the tuned module in the ompi/mca/coll/tuned folder of the Open MPI source tree. In the development branch, part of the tuned component got promoted to the base implementation of the collective framework and the code for the gather is to be found in ompi/mca/coll/base instead.
Hristo's answer is of course excellent, but I would like to offer a different point of view.
Contrary to your expectation, the question is not simple. It isn't even possible to specifically answer it without knowing more system specifics, as Hristo pointed out. That doesn't mean the question is invalid, but you should start to reason about performance on a different level.
First, consider the complexity of a the gather operation: The total network transfer to the root as well as the memory requirements are linearly growing with the number of processes in the communicator. This naturally limits scalability.
Second, you may assume that your MPI implementation does implement MPI_Gather in the most efficient way possible - better than you could do it by hand. This assumption may very well be wrong, but it is the best starting point to write your program.
Now when you have your program, you should measure and see where time is spent - or wasted. For that you should an MPI performance analysis tools. Now if you have identified that your Gather has a significant impact on performance, you can go ahead and try to optimize that: But to do so, first consider if you can structure your communication conceptually better, e.g. by somehow removing the computation all together or using a clever reduction instead. If you still need to stick to the gather: go ahead and tune your MPI implementation. Afterwards verify that your optimization did indeed improve performance on your specific system.

Passing multiple variables in MPI

I am trying an implementation in MPI where I am invoking multiple slaves (upto 4) on the same machine (localhost) and distributing the computations of my for loop amongst the slaves. MPI is suited for my current application and I cannot take the openMP route.
The variables that are involved are about 50 and all are uni-dimensional arrays.
What would be the best way to send the 50 variables to the master process? Should I send and receive all variables or should I pack them in one 2D array and send this array across to the master?
I am looking for an efficient and computationally inexpensive approach.
Thanks
As so often: it depends. If your individual arrays are sufficiently large, such that latency gets insignificant, it would be fine to send each one individually. Otherwise, it will be better to increase the size of your message by collecting all those arrays into a single one.
If your variables are of different type you could make use of MPI datatypes to describe the layout of your data.
Additionally, if you need to collect this data from multiple processes it might be a good idea to use MPI_Gather or one of its variants.
It might also be, that a viable option in your scenario would be to make use of the one-sided communication facilities offered by MPI.

Hadoop suitability for recursive data processing

I have a filtering algorithm that needs to be applied recursively and I am not sure if MapReduce is suitable for this job. W/o giving too much away, I can say that each object that is being filtered is characterized by a collection if ordered list or queue.
The data is not huge, just about 250MB when I export from SQL to
CSV.
The mapping step is simple: the head of the list contains an object that can classify the list as belonging to one of N mapping nodes. the filtration algorithm at each node works on the collection of lists assigned to the node and at the end of the filtration, either a list remains the same as before the filtration or the head of the list is removed.
The reduce function is simple too: all the map jobs' lists are brought together and may have to be written back to disk.
When all the N nodes have returned their output, the mapping step is repeated with this new set of data.
Note: N can be as much as 2000 nodes.
Simple, but it requires perhaps up to a 1000 recursions before the algorithm's termination conditions are met.
My question is would this job be suitable for Hadoop? If not, what are my options?
The main strength of Hadoop is its ability to transparently distribute work on a large number of machines. In order to fully benefit from Hadoop your application has to be characterized, at least by the following three things:
work with large amounts of data (data which is distributed in the cluster of machines) - which would be impossible to store on one machine
be data-parallelizable (i.e. chunks of the original data can be manipulated independently from other chunks)
the problem which the application is trying to solve lends itself nicely to the MapReduce (scatter - gather) model.
It seems that out of these 3, your application has only the last 2 characteristics (with the observation that you are trying to recursively use a scatter - gather procedure - which means a large number of jobs - equal to the recursion depth; see last paragraph why this might not be appropriate for hadoop).
Given the amount of data you're trying to process, I don't see any reason why you wouldn't do it on a single machine, completely in memory. If you think you can benefit from processing that small amount of data in parallel, I would recommend focusing on multicore processing than on distributed data intensive processing. Of course, using the processing power of a networked cluster is tempting but this comes at a cost: mainly the time inefficiency given by the network communication (network being the most contended resource in a hadoop cluster) and by the I/O. In scenarios which are well-fitted to the Hadoop framework these inefficiency can be ignored because of the efficiency gained by distributing the data and the associated work on that data.
As I can see, you need 1000 jobs. The setup and the cleanup of all those jobs would be an unnecessary overhead for your scenario. Also, the overhead of network transfer is not necessary, in my opinion.
Recursive algos are hard in the distributed systems since they can lead to a quick starvation. Any middleware that would work for that needs to support distributed continuations, i.e. the ability to make a "recursive" call without holding the resources (like threads) of the calling side.
GridGain is one product that natively supports distributed continuations.
THe litmus test on distributed continuations: try to develop a naive fibonacci implementation in distributed context using recursive calls. Here's the GridGain's example that implements this using continuations.
Hope it helps.
Q&D, but I suggest you read a comparison of MongoDB and Hadoop:
http://www.osintegrators.com/whitepapers/MongoHadoopWP/index.html
Without knowing more, it's hard to tell. You might want to try both. Post your results if you do!

How to store a huge hash table in RAM and share it between different applications?

The data contains information like billions of ID-scores pairs. To quickly access these paired information, I plan to use the hash-table container since its time complexity of search is O(1). Considering the the raw data is around 80G, I don't want to load the data into RAM every time when I need to run search application. What I want to do is to generate the hash-table once and then store it in RAM with persistence of filesystem lifetime (the expense of RAM is not a criteria), and search it with different applications.
Based on my limited understanding, I could use "Memory Mapped Files" (boost C++ libraries). But I have questions:
1) Is it possible to keep the hash-table data structure when write it to the mapped file?
2) How much time it will cost to map the existed file to RAM?
Any answers/comments/suggestions are most welcomed!
Thanks,
1) Yes. The file is just bytes, just like memory.
2) Creating the mapping will be effectively instantaneous. Node that you won't be able to map all of it contiguously at once except on a 64-bit OS. Of course, if the file cache can't hold the portion of the map you're using, it will have to be read from disk.
How big are IDs? How big are pairs? How much locality of reference do you have? (Are there heavily-used pair and lightly used pairs?) How often will you be searching for pairs that aren't present? Is the data read-mostly? There may be better ways to do it. I'd strongly suggest starting with a broader question to make sure you're not stuck on a sub-optimal path.

How to create a distributed suffix tree on HPC cluster

I want to create a suffix tree for 4 GB of input string. Ideally the size of tree in memory will be approx 100 GB. I can’t do this on a normal desktop. Is there any way to do it on windows HPC cluster? How can I distribute the suffix tree on different compute node of HPC?
Yes, it's possible to do this; Google Scholar lists several papers on the topic. The trick is in the initial assignment of partial suffixes to the initial processors; that has to be chosen so that each processor can go off and find it's section of the tree independantly. Once that's done, the usual suffix-tree operations can be done fairly efficiently. I don't know any public examples of implementations.

Resources