What is the most efficient way to store a set of points (embeddings) such that queries for closest points are computed quickly - information-retrieval

Given a set of embeddings, i.e. set of [name, vector representation]
how should I store it such that queries on the closest points are computed quickly. For example given 100 embeddings in 2-d space, if I query the data struct on the 5 closest points to (10,12), it returns { [a,(9,11.5)] , [b,(12,14)],...}
The trivial approach is calculate all distances, sort and return top-k points. Alternatively, one might think of storing in a 2-d array in blocks/units of mXn space to cover the range of the embedding space. I don't think this is extensible to higher dimensions, but I'm willing to be corrected.

There are standard approximate nearest neighbors library such as faiss, flann, java-lsh etc. (which are either LSH or Product Quantization based), which you may use.
The quickest solution (which I found useful) is to transform a vector of (say 100 dimensions) to a long variable (64 bits) by using the Johnson–Lindenstrauss transform. You can then use Hamming similarity (i.e. 64 minus the number of bits set in a XOR b) to compute the similarity between bit vectors a and b. You could use the POPCOUNT machine instruction to this effect (which is very fast).
In effect, if you use POPCOUNT in C, even if you do a complete iteration over the whole set of binary transformed vectors (long variables of 64 bits), it still will be very fast.

Related

cosine similarity LSH and random hyperplane

I read few solutions about nearest neighbor search in high-dimensions using random hyperplane, but I am still confused about how the buckets work. I have 100 millions of document in the form of 100-dimension vectors and 1 million queries. For each query, I need to find the nearest neighbor based on cosine similarity. The brute force approach is to find cosine value of query with all 100 million documents and select the the ones with value close to 1. I am struggling with the concept of random hyperplanes where I can put the documents in buckets so that I don't have to calculate cosine value 100 million times for each query.
Think in a geometric way. Imagine your data like points in a high dimensional space.
Create random hyperplanes (just planes in a higher dimension), do the reduction using your imagination.
These hyperplanes cut your data (the points), creating partitions, where some points are being positioned apart from others (every point in its partition; would be a rough approximation).
Now the buckets should be populated according to the partitions formed by the hyperplanes. As a result, every bucket contains much less points than the total size of the pointset (because every partition I talked about before, contains less points than the total size of your pointset).
As a consequence, when you pose a query, you check much less points (with the assistance of the buckets) than the total size. That's all the gain here, since checking less points, means that you do much better (faster) than the brute force approach, which checks all the points.

How to perform mathematical operations on large numbers

I have a question about working on very big numbers. I'm trying to run RSA algorithm and lets's pretend i have 512 bit number d and 1024 bit number n. decrypted_word = crypted_word^d mod n, isn't it? But those d and n are very large numbers! Non of standard variable types can handle my 512 bit numbers. Everywhere is written, that rsa needs 512 bit prime number at last, but how actually can i perform any mathematical operations on such a number?
And one more think. I can't use extra libraries. I generate my prime numbers with java, using BigInteger, but on my system, i have only basic variable types and STRING256 is the biggest.
Suppose your maximal integer size is 64 bit. Strings are not that useful for doing math in most languages, so disregard string types. Now choose an integer of half that size, i.e. 32 bit. An array of these can be interpreted as digits of a number in base 232. With these, you can do long addition and multiplication, just like you are used to with base 10 and pen and paper. In each elementary step, you combine two 32-bit quantities, to produce both a 32-bit result and possibly some carry. If you do the elementary operation in 64-bit arithmetic, you'll have both of these as part of a single 64-bit variable, which you'll then have to split into the 32-bit result digit (via bit mask or simple truncating cast) and the remaining carry (via bit shift).
Division is harder. But if the divisor is known, then you may get away with doing a division by constant using multiplication instead. Consider an example: division by 7. The inverse of 7 is 1/7=0.142857…. So you can multiply by that to obtain the same result. Obviously we don't want to do any floating point math here. But you can also simply multiply by 14286 then omit the last six digits of the result. This will be exactly the right result if your dividend is small enough. How small? Well, you compute x/7 as x*14286/100000, so the error will be x*(14286/100000 - 1/7)=x/350000 so you are on the safe side as long as x<350000. As long as the modulus in your RSA setup is known, i.e. as long as the key pair remains the same, you can use this approach to do integer division, and can also use that to compute the remainder. Remember to use base 232 instead of base 10, though, and check how many digits you need for the inverse constant.
There is an alternative you might want to consider, to do modulo reduction more easily, perhaps even if n is variable. Instead of expressing your remainders as numbers 0 through n-1, you could also use 21024-n through 21024-1. So if your initial number is smaller than 21024-n, you add n to convert to this new encoding. The benefit of this is that you can do the reduction step without performing any division at all. 21024 is equivalent to 21024-n in this setup, so an elementary modulo reduction would start by splitting some number into its lower 1024 bits and its higher rest. The higher rest will be right-shifted by 1024 bits (which is just a change in your array indexing), then multiplied by 21024-n and finally added to the lower part. You'll have to do this until you can be sure that the result has no more than 1024 bits. How often that is depends on n, so for fixed n you can precompute that (and for large n I'd expect it to be two reduction steps after addition but hree steps after multiplication, but please double-check that) whereas for variable n you'll have to check at runtime. At the very end, you can go back to the usual representation: if the result is not smaller than n, subtract n. All of this should work as described if n>2512. If not, i.e. if the top bit of your modulus is zero, then you might have to make further adjustments. Haven't thought this through, since I only used this approach for fixed moduli close to a power of two so far.
Now for that exponentiation. I very much suggest you do the binary approach for that. When computing xd, you start with x, x2=x*x, x4=x2*x2, x8=…, i.e. you compute all power-of-two exponents. You also maintain some intermediate result, which you initialize to one. In every step, if the corresponding bit is set in the exponent d, then you multiply the corresponding power into that intermediate result. So let's say you have d=11. Then you'd compute 1*x1*x2*x8 because d=11=1+2+8=10112. That way, you'll need only about 1024 multiplications max if your exponent has 512 bits. Half of them for the powers-of-two exponentiation, the other to combine the right powers of two. Every single multiplication in all of this should be immediately followed by a modulo reduction, to keep memory requirements low.
Note that the speed of the above exponentiation process will, in this simple form, depend on how many bits in d are actually set. So this might open up a side channel attack which might give an attacker access to information about d. But if you are worried about side channel attacks, then you really should have an expert develop your implementation, because I guess there might be more of those that I didn't think about.
You may write some macros you may execute under Microsoft for functions like +, -, x, /, modulo, x power y which work generally for any integer of less than ten or hundred thousand digits (the practical --not theoretical-- limit being the internal memory of your CPU). Please note the logic is exactly the same as the one you got at elementary school.
E.g.: p= 1819181918953471 divider of (2^8091) - 1, q = ((2^8091) - 1)/p, mod(2^8043 ; q ) = 23322504995859448929764248735216052746508873363163717902048355336760940697615990871589728765508813434665732804031928045448582775940475126837880519641309018668592622533434745187004918392715442874493425444385093718605461240482371261514886704075186619878194235490396202667733422641436251739877125473437191453772352527250063213916768204844936898278633350886662141141963562157184401647467451404036455043333801666890925659608198009284637923691723589801130623143981948238440635691182121543342187092677259674911744400973454032209502359935457437167937310250876002326101738107930637025183950650821770087660200075266862075383130669519130999029920527656234911392421991471757068187747362854148720728923205534341236146499449910896530359729077300366804846439225483086901484209333236595803263313219725469715699546041162923522784170350104589716544529751439438021914727772620391262534105599688603950923321008883179433474898034318285889129115556541479670761040388075352934137326883287245821888999474421001155721566547813970496809555996313854631137490774297564881901877687628176106771918206945434350873509679638109887831932279470631097604018939855788990542627072626049281784152807097659485238838560958316888238137237548590528450890328780080286844038796325101488977988549639523988002825055286469740227842388538751870971691617543141658142313059934326924867846151749777575279310394296562191530602817014549464614253886843832645946866466362950484629554258855714401785472987727841040805816224413657036499959117701249028435191327757276644272944743479296268749828927565559951441945143269656866355210310482235520220580213533425016298993903615753714343456014577479225435915031225863551911605117029393085632947373872635330181718820669836830147312948966028682960518225213960218867207825417830016281036121959384707391718333892849665248512802926601676251199711698978725399048954325887410317060400620412797240129787158839164969382498537742579233544463501470239575760940937130926062252501116458281610468726777710383038372260777522143500312913040987942762244940009811450966646527814576364565964518092955053720983465333258335601691477534154940549197873199633313223848155047098569827560014018412679602636286195283270106917742919383395056306107175539370483171915774381614222806960872813575048014729965930007408532959309197608469115633821869206793759322044599554551057140046156235152048507130125695763956991351137040435703946195318000567664233417843805257728.
The last step took about 0.1 sec.
wpjo (willibrord oomen on academia.edu)

Efficient Calculation of an N-Dimensional Cross Product?

As per the title, is the best way to calculate the n-dimensional cross product just using the determinant definition and using the LU Decomposition method of doing as such or could you guys suggest a better one?
Thanks
Edit: for clarity I mean http://en.wikipedia.org/wiki/Cross_product and not the Cartesian Product
Edit: It also seems that using the Leibniz Formula might help - though I don't know how that compares to LU Decomp. at the moment.
From your comment, it seems like you are looking for an operation which takes n −1 vectors as input and computes a single vector as its result, which will be orthogonal to all the input vectors and perhaps have a well-defined length as well.
With defined length
You can characterize the 3-dimensional cross product v =a ×b using the identity v ∙w =det(a,b,w). In other words, taking the cross product of the input vectors and then computing the dot product with any other vector w is the same as plugging the input vectors and that other vector into a matrix and computing its determinant.
This definition can be generalized to arbitrary dimensions. Due to the way a determinant can be computed using Laplace expansion along the last column, the resulting coordinates of that cross product will be the values of all (n −1)×(n −1) sub-determinants you can form from the input vectors, with alternating signs. So yes, Leibniz might be useful in theory, although it is hardly suitable for real-world computations. In practice, you'll soon have to figure out ways to avoid repeating computationswhile computing these n determinants. But wait for the last section of this answer…
Just the direction
Most applications however can do with a weaker requirement. They don't care about the length of the resulting vector, but only about its direction. In that case, what you are asking for is the kernel of the (n −1)×n matrix you can form by taking the input vectors as rows. Any element of that kernel will be orthogonal to the input vectors, and since computing kernels is a common task, you can build on a lot of existing implementations, e.g. Lapack. Details might depend on the language you are using.
Combining these
You can even combine the two approaches above: compute one element of the kernel, and for a non-zero entry of that vector, also compute the corresponding (n −1)×(n −1) determinant which would give you that single coordinate using the first approach. You can then simply scale the vector so that the selected coordinate reaches the computed value, and all the other coordinates will match that one.

Proper similarity measure for clustering

I have problems in finding a proper similarity measure for clustering. I have around 3000 arrays of sets, where each set contains features of certain domain (e.g., number, color, days, alphabets, etc). I'll explain my problem with an example.
Lets assume i have only 2 arrays(a1 & a2) and I want to find the similarity between them. each array contains 4 sets (in my actual problem there are 250 sets (domains) per array) and a set can be empty.
a1: {a,b}, {1,4,6}, {mon, tue, wed}, {red, blue,green}
a2: {b,c}, {2,4,6}, {}, {blue, black}
I have come with a similarity measure using Jaccard index (denoted as J):
sim(a1,a2) = [J(a1[0], a2[0]) + J(a1[1], a2[1]) + ... + J(a1[3], a2[3])]/4
note:I divide by total number of sets (in the above example 4) to keep the similarity between 0 and 1.
Is this a proper similarity measure and are there any flaws in this approach. I am applying Jaccard index for each set separately because I want compare the similarity between related domains(i.e. color with color, etc...)
I am not aware of any other proper similarity measure for my problem.
Further, can I use this similarity measure for clustering purpose?
This should work for most clustering algorithms. Don't use k-means - it can handle numeric vector spaces only. But you have a vector-of-sets type of data.
You may want to use a different mean than the arithmetic average for combining the four Jaccard measures. Try the harmonic or geometric means. See, the average over 250 values will likely be somewhere close to 0.5 all the time, so you need a mean that is more "aggressive".
So the plan sounds good. Just try it, implement this similarity and plug it into various clustering algorithm and see if they find something. I like OPTICS for exploring data and distance functions, as the OPTICS plot can be very indicative whether (or not!) there is something to be found based on the distance function. If the plot is too flat, there just is not much to be found, it is like a representative sample of the distances in the data set...
I use ELKI, and they even have a tutorial on adding custom distance functions: http://elki.dbs.ifi.lmu.de/wiki/Tutorial/DistanceFunctions although you can probably just compute the distances with whatever tool you like and write them to a similarity matrix. At 3000 objects this will remain very manageable, 4200000 doubles is just a few MB.

Find number range intersection

What is the best way to find out whether two number ranges intersect?
My number range is 3023-7430, now I want to test which of the following number ranges intersect with it: <3000, 3000-6000, 6000-8000, 8000-10000, >10000. The answer should be 3000-6000 and 6000-8000.
What's the nice, efficient mathematical way to do this in any programming language?
Just a pseudo code guess:
Set<Range> determineIntersectedRanges(Range range, Set<Range> setofRangesToTest)
{
Set<Range> results;
foreach (rangeToTest in setofRangesToTest)
do
if (rangeToTest.end <range.start) continue; // skip this one, its below our range
if (rangeToTest.start >range.end) continue; // skip this one, its above our range
results.add(rangeToTest);
done
return results;
}
I would make a Range class and give it a method boolean intersects(Range) . Then you can do a
foreach(Range r : rangeset) { if (range.intersects(r)) res.add(r) }
or, if you use some Java 8 style functional programming for clarity:
rangeset.stream().filter(range::intersects).collect(Collectors.toSet())
The intersection itself is something like
this.start <= other.end && this.end >= other.start
This heavily depends on your ranges. A range can be big or small, and clustered or not clustered. If you have large, clustered ranges (think of "all positive 32-bit integers that can be divided by 2), the simple approach with Range(lower, upper) will not succeed.
I guess I can say the following:
if you have little ranges (clustering or not clustering does not matter here), consider bitvectors. These little critters are blazing fast with respect to union, intersection and membership testing, even though iteration over all elements might take a while, depending on the size. Furthermore, because they just use a single bit for each element, they are pretty small, unless you throw huge ranges at them.
if you have fewer, larger ranges, then a class Range as describe by otherswill suffices. This class has the attributes lower and upper and intersection(a,b) is basically b.upper < a.lower or a.upper > b.lower. Union and intersection can be implemented in constant time for single ranges and for compisite ranges, the time grows with the number of sub-ranges (thus you do not want not too many little ranges)
If you have a huge space where your numbers can be, and the ranges are distributed in a nasty fasion, you should take a look at binary decision diagrams (BDDs). These nifty diagrams have two terminal nodes, True and False and decision nodes for each bit of the input. A decision node has a bit it looks at and two following graph nodes -- one for "bit is one" and one for "bit is zero". Given these conditions, you can encode large ranges in tiny space. All positive integers for arbitrarily large numbers can be encoded in 3 nodes in the graph -- basically a single decision node for the least significant bit which goes to False on 1 and to True on 0.
Intersection and Union are pretty elegant recursive algorithms, for example, the intersection basically takes two corresponding nodes in each BDD, traverse the 1-edge until some result pops up and checks: if one of the results is the False-Terminal, create a 1-branch to the False-terminal in the result BDD. If both are the True-Terminal, create a 1-branch to the True-terminal in the result BDD. If it is something else, create a 1-branch to this something-else in the result BDD. After that, some minimization kicks in (if the 0- and the 1-branch of a node go to the same following BDD / terminal, remove it and pull the incoming transitions to the target) and you are golden. We even went further than that, we worked on simulating addition of sets of integers on BDDs in order to enhance value prediction in order to optimize conditions.
These considerations imply that your operations are bounded by the amount of bits in your number range, that is, by log_2(MAX_NUMBER). Just think of it, you can intersect arbitrary sets of 64-bit-integers in almost constant time.
More information can be for example in the Wikipedia and the referenced papers.
Further, if false positives are bearable and you need an existence check only, you can look at Bloom filters. Bloom filters use a vector of hashes in order to check if an element is contained in the represented set. Intersection and Union is constant time. The major problem here is that you get an increasing false-positive rate if you fill up the bloom-filter too much.
Information, again, in the Wikipedia, for example.
Hach, set representation is a fun field. :)
In python
class nrange(object):
def __init__(self, lower = None, upper = None):
self.lower = lower
self.upper = upper
def intersection(self, aRange):
if self.upper < aRange.lower or aRange.upper < self.lower:
return None
else:
return nrange(max(self.lower,aRange.lower), \
min(self.upper,aRange.upper))
If you're using Java
Commons Lang Range
has a
overlapsRange(Range range) method.

Resources