Hashing Questions - math

I've been doing some questions related to hashing and came across these three. I am a bit confused as how to go about solving them.
The question is stated as follows,
Suppose a hashing function H(m) that takes any input m and produces a fixed-length 64-bit digest h. Answer the following questions
a) How many hashes would you need to compute in order to have a 50% probability of finding any two inputs m1 and m2 such that H(m1) = H(m2)?
b) Given some hash digest h, how many inputs would you need to hash in order to have a 50% probability of finding an input m such that h = H(m)?
c) Given some input m1, how many inputs would you need to hash in order to have a 50% probability of finding an input m2 such that H(m1) = H(m2)
Would we simply be using the birthday paradox to solve these questions or is there another way to solve them?

Related

Calculate output of dynamic fnn without linear algebra

Working on a project which includes evolving the fnn topology during runtime, I want to outsource the nn output calculation to my gpu. Currently using unity (soon switching to just c# libs without unity or smth similar), I use hlsl like compute shaders.
Now since I am very restricted with hlsl's syntax (Arrays/Matrices etc with dynamic index ranges, dot product and matrix functions only working on preexisting types like float2 and those "vectors" having a max length of 4 (float4)), I am looking for a formula with which I can calculate the output of the fnn within a single calculation (obviously still including a loop). That means my shader has the weights, input layer, bias and structure of the nn each as seperate structured buffers. Now I need to find a formula with which I can calculate the output without using dot products or matrix vector multiplications since it's very hard and tortuous to implement those. However I've tried finding a formula for days, hopelessly...
Does anyone have a clue for this? Sadly not so many less recourses on the internet concerning hlsl and this problem.
Thanks a lot!
EDIT:
Okay I got some hints that my actual math related question is not quite clear. Since I am kind of an artist myself, I made a little illustration:
Now, there of course are multiple ways to calculate the output O. I wanna record following:
v3 = v1w1+v2w2
v4 = v1w3+v2w4
v5 = v3w5+v4w6 = w5*(v1w1+v2w2)+w6*(v1w3+v2w4)
This is an easy way to calculate the final output O in one step:
O1 = w5*(v1w1+v2w2)+w6*(v1w3+v2w4).
There are other quite cool ways to calculate the output in one step. Looking at the weights as matrices, e.g.
m1 = [[w1, w2], [w3, w4]] and m2 = [[w5, w6]].
This way we can calculate O as
L1 = m1 * I and O = m2 * L1
or in one step
O = m2*(m1*I)
which is the more elegant way. I would prefer it this way but I cant do it with matrix-vector multiplications or any other linear algebra since I am restricted by my programming tools, so I have to stay with this shape:
O1 = w5*(v1w1+v2w2)+w6*(v1w3+v2w4).
Now this would be really really easy if I had a neural network with fixed topology. However, since it evolves during runtime I have to find a function/formula with which I can calculate O independent from the topology. All information I have is ONE list of ALL weights (w=[w1, w2, w3, w4, w5, w6]), a list of inputs (i=[v1, v2]) and the structure of the nn (s=[2, 2, 1] - 2 input nodes, one hidden layer with 2 nodes, and one output node).
However I cant think of an algorithm that calculates the output with the given information efficiently.

Why does textreuse packge in R make LSH buckets way larger than the original minhashes?

As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse package in R, and I am surprised by the size of the data it generates. textreuse is a peer-reviewed ROpenSci package, so I assume it does its job correctly, but my question persists.
Let's say I use 256 permutations and 64 bands for my minhash and LSH functions respectively -- realistic values that are often used to detect with relative certainty (~98%) similarities as low as 50%.
If I hash a random text file using TextReuseTextDocument (256 perms) and assign it to trtd, I will have:
object.size(trtd$minhashes)
> 1072 bytes
Now let's create the LSH buckets for this object (64 bands) and assign it to l, I will have:
object.size(l$buckets)
> 6704 bytes
So, the retained hashes in the LSH buckets are six times larger than the original minhashes. I understand this happens because textreuse uses a md5 digest to create the bucket hashes.
But isn't this too wasteful / overkill, and can't I improve it? Is it normal that our data reduction technique ends up bloating up to this extent? And isn't it more efficacious to match the documents based on the original hashes (similar to perms = 256 and bands = 256) and then use a threshold to weed out the false positives?
Note that I have reviewed the typical texts such as Mining of Massive Datasets, but this question remains about this particular implementation. Also note that the question is not only out of curiosity, but rather out of need. When you have millions or billions of hashes, these differences become significant.
Package author here. Yes, it would be wasteful to use more hashes/bands than you need. (Though keep in mind we are talking about kilobytes here, which could be much smaller than the original documents.)
The question is, what do you need? If you need to find only matches that are close to identical (i.e., with a Jaccard score close to 1.0), then you don't need a particularly sensitive search. If, however, you need to reliable detect potential matches that only share a partial overlap (i.e., with a Jaccard score that is closer to 0), then you need more hashes/bands.
Since you've read MMD, you can look up the equation there. But there are two functions in the package, documented here, which can help you calculate how many hashes/bands you need. lsh_threshold() will calculate the threshold Jaccard score that will be detected; while lsh_probability() will tell you how likely it is that a pair of documents with a given Jaccard score will be detected. Play around with those two functions until you get the number of hashes/bands that is optimal for your search problem.

Simple function to generate random number sequence without knowing previous number but know current index (no variable assignment)?

Is there any (simple) random generation function that can work without variable assignment? Most functions I read look like this current = next(current). However currently I have a restriction (from SQLite) that I cannot use any variable at all.
Is there a way to generate a number sequence (for example, from 1 to max) with only n (current number index in the sequence) and seed?
Currently I am using this:
cast(((1103515245 * Seed * ROWID + 12345) % 2147483648) / 2147483648.0 * Max as int) + 1
with max being 47, ROWID being n. However for some seed, the repeat rate is too high (3 unique out of 47).
In my requirements, repetition is ok as long as it's not too much (<50%). Is there any better function that meets my need?
The question has sqlite tag but any language/pseudo-code is ok.
P.s: I have tried using Linear congruential generators with some a/c/m triplets and Seed * ROWID as Seed, but it does not work well, it's even worse.
EDIT: I currently use this one, but I do not know where it's from. The rate looks better than mine:
((((Seed * ROWID) % 79) * 53) % "Max") + 1
I am not sure if you still have the same problem but I might have a solution for you.
What you could do is use Pseudo Random M-sequence generators based on shifting registers. Where you just have to take high enough order of you primitive polynomial and you don't need to store any variables really.
For more info you can check the wiki page
What you would need to code is just the primitive polynomial shifting equation and I have checked in an online editor it should be very easy to do. I think the easiest way for you would be to use Binary base and use PRBS sequences and depending on how many elements you will have you can choose your sequence length. For example this is the implementation for length of 2^15 = 32768 (PRBS15), the primitive polynomial I took from the wiki page (There youcan find the primitive polynomials all the way to PRBS31 what would be 2^31=2.1475e+09)
Basically what you need to do is:
SELECT (((ROWID << 1) | (((ROWID >> 14) <> (ROWID >> 13)) & 1)) & 0x7fff)
The beauty of this approach is if you take the sequence of the PRBS with longer period than your ROWID largest value you will have unique random index. Very simple. :)
If you need help with searching for primitive polynomials you can see my github repo which deals exactly with finding primitive polynomials and unique m-sequences. It is currently written in Matlab, but I plan to write it in python in next few days.
Cheers!
What about using good hash function and map result into [1...max] range?
Along the lines (in pseudocode). sha1 was added to SQLite 3.17.
sha1(ROWID) % Max + 1
Or use any external C code for hash (murmur, chacha, ...) as shown here
A linear congruential generator with appropriately-chosen parameters (a, c, and modulus m) will be a full-period generator, such that it cycles pseudorandomly through every integer in its period before repeating. Although you may have tried this idea before, have you considered that m is equivalent to max in your case? For a list of parameter choices for such generators, see L'Ecuyer, P., "Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure", Mathematics of Computation 68(225), January 1999.
Note that there are some practical issues to implementing this in SQLite, especially if your SQLite version supports only 32-bit integers and 64-bit floating-point numbers (with 52 bits of precision). Namely, there may be a risk of—
overflow if an intermediate multiplication exceeds 32 bits for integers, and
precision loss if an intermediate multiplication results in a greater-than-52-bit number.
Also, consider why you are creating the random number sequence:
Is the sequence intended to be unpredictable? In that case, a linear congruential generator alone is not enough, and you should generate unique identifiers by other means, such as by combining unique numbers with cryptographically random numbers.
Will the numbers generated this way be exposed in any way to end users? If not, there is no need to obfuscate them by "shuffling" them.
Also, depending on the SQLite API you're using (for your programming language), there may be a way to write a custom function to convert the seed and ROWID to a random unique number. The details, however, depend heavily on the specific SQLite API. Another answer shows an example for Perl.

Probability of Collisions in Hash Table

When inserting n items into a hash table of size m, assuming that the destination of each item is independently uniformly random, what is the probability that no collision occurs?
My working thus far:
We have n items and m locations.
Each item has a 1/m chance of being in any particular location.
There are nC2 possible pairs of items.
The probability of there being no collisions is the probability that for every location, every pair of items does not hash to that location.
For any given location, for any given pair, the probability that the two items do not hash to that location is (m-1)/m.
Then, for any given location, the probability that the above is true for ALL pairs is ((m-1)/m)^(nC2).
Then, the probability that this is true for every location is
[((m-1)/m)^(nC2)]^(m).
You made a few mistakes in that reasoning. The main one is that you assume that the probabilities for pairs not hashing together are independent, so you can multiply them together. You have not shown that is the case, and in fact it is not the case. Consider three elements a, b, and c. If you know that both a and b do not collide with c, then they are limited to m-1 places rather than the initial m places, and they are more likely to collide with each other than if you just ignore c.
Here is a straightforward way to find your desired probability. Looking at the total possibilities ignoring collisions, each of the n items has m places to go. Those placements are independent, so the total possibilities are m^n (or m**n in Python) if we take order into account.
If we know there are no collisions, those n items are a way of choosing out of the m locations without replacement. So if we take order into account, that makes mPn possibilities -- the ways to choose n items out of m choices without replacement and with order (permutations). Therefore your desired probability is
mPn / m^n = (m!) / ((m-n)! * m^n) = m/m * (m-1)/m * (m-2)/m * ... * (m-n+1)/m
There are n factors in that last expression. (This would be so much better in MathJax!) You can choose which of those three equivalent expressions is best for your purpose.
There are other ways to come up with those expressions, of course. That last one can be thought of as the probability of no collision placing 1 item in m slots times the conditional probability of placing a second item given no prior collision times the conditional probability of placing a third item given no prior collision times ....
Those expressions are fairly easy to test. Just choose specific, small values of m and n, generate all possible choices of n items out of those m, and find the empirical probability of no collisions. This should agree with the formula(s) above. I'll leave the choice of programming language and the coding to you. After all, this is a programming site. I just did this in Python, for multiple choices of n and m, and it works out.

How to check if m n-sized vectors are linearly independent?

Disclaimer
This is not strictly a programming question, but most programmers soon or later have to deal with math (especially algebra), so I think that the answer could turn out to be useful to someone else in the future.
Now the problem
I'm trying to check if m vectors of dimension n are linearly independent. If m == n you can just build a matrix using the vectors and check if the determinant is != 0. But what if m < n?
Any hints?
See also this video lecture.
Construct a matrix of the vectors (one row per vector), and perform a Gaussian elimination on this matrix. If any of the matrix rows cancels out, they are not linearly independent.
The trivial case is when m > n, in this case, they cannot be linearly independent.
Construct a matrix M whose rows are the vectors and determine the rank of M. If the rank of M is less than m (the number of vectors) then there is a linear dependence. In the algorithm to determine the rank of M you can stop the procedure as soon as you obtain one row of zeros, but running the algorithm to completion has the added bonanza of providing the dimension of the spanning set of the vectors. Oh, and the algorithm to determine the rank of M is merely Gaussian elimination.
Take care for numerical instability. See the warning at the beginning of chapter two in Numerical Recipes.
If m<n, you will have to do some operation on them (there are multiple possibilities: Gaussian elimination, orthogonalization, etc., almost any transformation which can be used for solving equations will do) and check the result (eg. Gaussian elimination => zero row or column, orthogonalization => zero vector, SVD => zero singular number)
However, note that this question is a bad question for a programmer to ask, and this problem is a bad problem for a program to solve. That's because every linearly dependent set of n<m vectors has a different set of linearly independent vectors nearby (eg. the problem is numerically unstable)
I have been working on this problem these days.
Previously, I have found some algorithms regarding Gaussian or Gaussian-Jordan elimination, but most of those algorithms only apply to square matrix, not general matrix.
To apply for general matrix, one of the best answers might be this:
http://rosettacode.org/wiki/Reduced_row_echelon_form#MATLAB
You can find both pseudo-code and source code in various languages.
As for me, I transformed the Python source code to C++, causes the C++ code provided in the above link is somehow complex and inappropriate to implement in my simulation.
Hope this will help you, and good luck ^^
If computing power is not a problem, probably the best way is to find singular values of the matrix. Basically you need to find eigenvalues of M'*M and look at the ratio of the largest to the smallest. If the ratio is not very big, the vectors are independent.
Another way to check that m row vectors are linearly independent, when put in a matrix M of size mxn, is to compute
det(M * M^T)
i.e. the determinant of a mxm square matrix. It will be zero if and only if M has some dependent rows. However Gaussian elimination should be in general faster.
Sorry man, my mistake...
The source code provided in the above link turns out to be incorrect, at least the python code I have tested and the C++ code I have transformed does not generates the right answer all the time. (while for the exmample in the above link, the result is correct :) -- )
To test the python code, simply replace the mtx with
[30,10,20,0],[60,20,40,0]
and the returned result would be like:
[1,0,0,0],[0,1,2,0]
Nevertheless, I have got a way out of this. It's just this time I transformed the matalb source code of rref function to C++. You can run matlab and use the type rref command to get the source code of rref.
Just notice that if you are working with some really large value or really small value, make sure use the long double datatype in c++. Otherwise, the result will be truncated and inconsistent with the matlab result.
I have been conducting large simulations in ns2, and all the observed results are sound.
hope this will help you and any other who have encontered the problem...
A very simple way, that is not the most computationally efficient, is to simply remove random rows until m=n and then apply the determinant trick.
m < n: remove rows (make the vectors shorter) until the matrix is square, and then
m = n: check if the determinant is 0 (as you said)
m < n (the number of vectors is greater than their length): they are linearly dependent (always).
The reason, in short, is that any solution to the system of m x n equations is also a solution to the n x n system of equations (you're trying to solve Av=0). For a better explanation, see Wikipedia, which explains it better than I can.

Resources