Related
I'm looking for a mixing function that given an integer from an interval <0, n) returns a random-looking integer from the same interval. The interval size n will typically be a composite non power of 2 number. I need the function to be one to one. It can only use O(1) memory, O(1) time is strongly preferred. I'm not too concerned about randomness of the output, but visually it should look random enough (see next paragraph).
I want to use this function as a pixel shuffling step in a realtime-ish renderer to select the order in which pixels are rendered (The output will be displayed after a fixed time and if it's not done yet this gives me a noisy but fast partial preview). Interval size n will be the number of pixels in the render (n = 1920*1080 = 2073600 would be a typical value). The function must be one to one so that I can be sure that every pixel is rendered exactly once when finished.
I've looked at the reversible building blocks used by hash prospector, but these are mostly specific to power of 2 ranges.
The only other method I could think of is multiply by large prime, but it doesn't give particularly nice random looking outputs.
What are some other options here?
Here is one solution based on the idea of primitive roots modulo a prime:
If a is a primitive root mod p then the function g(i) = a^i % p is a permutation of the nonzero elements which are less than p. This corresponds to the Lehmer prng. If n < p, you can get a permutation of 0, ..., n-1 as follows: Given i in that range, first add 1, then repeatedly multiply by a, taking the result mod p, until you get an element which is <= n, at which point you return the result - 1.
To fill in the details, this paper contains a table which gives a series of primes (all of which are close to various powers of 2) and corresponding primitive roots which are chosen so that they yield a generator with good statistical properties. Here is a part of that table, encoded as a Python dictionary in which the keys are the primes and the primitive roots are the values:
d = {32749: 30805,
65521: 32236,
131071: 66284,
262139: 166972,
524287: 358899,
1048573: 444362,
2097143: 1372180,
4194301: 1406151,
8388593: 5169235,
16777213: 9726917,
33554393: 32544832,
67108859: 11526618,
134217689: 70391260,
268435399: 150873839,
536870909: 219118189,
1073741789: 599290962}
Given n (in a certain range -- see the paper if you need to expand that range), you can find the smallest p which works:
def find_p_a(n):
for p in sorted(d.keys()):
if n < p:
return p, d[p]
once you know n and the matching p,a the following function is a permutation of 0 ... n-1:
def f(i,n,p,a):
x = a*(i+1) % p
while x > n:
x = a*x % p
return x-1
For a quick test:
n = 2073600
p,a = find_p_a(n) # p = 2097143, a = 1372180
nums = [f(i,n,p,a) for i in range(n)]
print(len(set(nums)) == n) #prints True
The average number of multiplications in f() is p/n, which in this case is 1.011 and will never be more than 2 (or very slightly larger since the p are not exact powers of 2). In practice this method is not fundamentally different from your "multiply by a large prime" approach, but in this case the factor is chosen more carefully, and the fact that sometimes more than 1 multiplication is required adding to the apparent randomness.
I don't know why this is so hard for me to figure out.
For example, I have two functions, f(x, y) and g(x, y). I want to find the values of x and y such that:
f(x, y) is at a target value (minimize the difference from the target)
g(x, y) is minimized (can be negative, doesn't stop at 0)
x and y are bounded (so g's minimum doesn't necessarily have a gradient of 0)
So if I were just finding a solution for f, I could minimize abs(f(x, y) - target), for instance, and it will hit zero when it's found a solution. But there are multiple such solutions and I also want to find the one that minimizes g.
So how do I combine these two functions into a single expression that I can then minimize (using a Newton-like method)?
My first attempt was 100*abs(f(x, y) - target) + g(x, y) to strongly emphasize hitting the target first, and it worked for some cases of the target value, but failed for others, since g(x, y) could go so negative that it dominated the combination and the optimizer stopped caring about f. How do I guarantee that f hitting the target is always dominant?
Are there general rules for how to combine multiple objectives into a single objective?
There is a rich literature about multi-objective optimization. Two popular methods are weighted objective and a lexicographic approach.
A weighted objective could be designed as:
min w1 * [f-target]^2 + w2 * g
for some weights w1, w2 >= 0. Often we have w1+w2=1 so we can also write:
min w1 * [f-target]^2 + (1-w1) * g
Set w1 to a larger value than w2 to put emphasis on the f objective.
The lexicographic method assumes an ordering of objectives. It can look like:
Solve with first objective z = min [f-target]^2. Let z* be the optimal objective.
Solve with the second objective while staying close to z*:
min g subject to [f-target]^2-z* <= tolerance
To measure the deviation between the target and f I used a quadratic function here. You can also use an absolute value.
Since you cannot exactly get f(x,y)-target to be zero, you have to accept some amount of error. I will use the relative error r = abs((f(x, y) - target)/target).
A function which grows extremely rapidly with r should do the trick.
exp(r/epsilon) + g(x, y)
If I choose epsilon = 1e-10, then I know r has to be less than 1e-7, because exp(1000) is an enormous number, but when r is small, like r = 1e-12, then the exponential changes very slowly, and g(x,y) will be the dominant term. You can even take it a step further and calculate how close x and y are to their true value, but its usually just easier to adjust the parameter until you get what you need.
I am trying to find all roots [f(x) = 0] in a function. My current solution only works if they are spaced out enough and don't interfer with each other. (e.g. it works for x^2 - 2)
bool numberIsCloseToZero(num number){
return (num.parse(number.abs().toStringAsFixed(1)) == 0.0) ? true : false;
}
List<num> calculateRoots(String function){
num eval = 0.0;
List<num> roots = [];
for (num x = -10; x < 10; x += 0.1){
eval = calculateYOfX(function, x);
if (numberIsCloseToZero(num.parse(eval.toStringAsFixed(2)))){
roots.add(x);
}
}
return roots;
}
Obviously, this is due to my rounding. (e.g. the surrounding values of the root of x^2 are too close to zero, so it assumes they are roots as well). Do you think I should go through actually solving the equation instead of "brute forcing" the roots?
Thanks
If you can find analytical solution - use it. It is possible for low-degree polynomial equations (like mentioned x^2 - 2).
In general case - you definitely have to learn numerical methods - in this case, root finding.
Start with bisection method or Newthon's method. They allow to get more and more exact position of root at every step.
You'll need to put some restrictions on what is an allowable function, otherwise you have no hope.
For example without any restrictions you've no guarantee that there are is only a finite number of values (consider f(x)=sin(x) ), or even a finite number of values in a given interval (consider f(x)=x sin(1/x) ). Or even an infinity of connected zeros ( f(x) = max(0,x) )
And these cases are not even considered particularly pathological mathematical functions.
If you're willing to go down the path of requiring your function to be non-zero almost-everywhere, smooth, continuous and with bounded first and second derivatives then I think you may be able to come up with a relatively simple algorithm that guarantees you get all zeros in a given finite region.
(I'd look for a subdivision based algorithm which recursively splits the region and determines strict bounds on each interval.)
We can derive an example algorithm for when the derivative is bounded by a known constant i.e. |f'(x)| < D. Note that if we evaluate f at some point p then for any other point p+d we can show that f(p) - |d| D < f(p+d) < f(p) + |d| D.
Using this we can consider root finding in an interval [A,B] - which we can write as [p-d, p+d] where p=(A+B)/2, d=(B-A)/2. Sample f at the mid-point to get f(p). The minimum value f could take on the interval is f(p) - d D and the maximum value is f(p) + d D. We can only have a root in this interval if f(p)-d D <= 0 <= f(p) +d D which is equivalent to |f(p)| < d D.
If there can be no root in [A,B] we're done, otherwise we repeat on the two halves [A,p] and [p,B]. (some care needs to be taken in the case f(p)=0 )
Consider the set of non-decreasing surjective (onto) functions from (-inf,inf) to [0,1].
(Typical CDFs satisfy this property.)
In other words, for any real number x, 0 <= f(x) <= 1.
The logistic function is perhaps the most well-known example.
We are now given some constraints in the form of a list of x-values and for each x-value, a pair of y-values that the function must lie between.
We can represent that as a list of {x,ymin,ymax} triples such as
constraints = {{0, 0, 0}, {1, 0.00311936, 0.00416369}, {2, 0.0847077, 0.109064},
{3, 0.272142, 0.354692}, {4, 0.53198, 0.646113}, {5, 0.623413, 0.743102},
{6, 0.744714, 0.905966}}
Graphically that looks like this:
(source: yootles.com)
We now seek a curve that respects those constraints.
For example:
(source: yootles.com)
Let's first try a simple interpolation through the midpoints of the constraints:
mids = ({#1, Mean[{#2,#3}]}&) ### constraints
f = Interpolation[mids, InterpolationOrder->0]
Plotted, f looks like this:
(source: yootles.com)
That function is not surjective. Also, we'd like it to be smoother.
We can increase the interpolation order but now it violates the constraint that its range is [0,1]:
(source: yootles.com)
The goal, then, is to find the smoothest function that satisfies the constraints:
Non-decreasing.
Tends to 0 as x approaches negative infinity and tends to 1 as x approaches infinity.
Passes through a given list of y-error-bars.
The first example I plotted above seems to be a good candidate but I did that with Mathematica's FindFit function assuming a lognormal CDF.
That works well in this specific example but in general there need not be a lognormal CDF that satisfies the constraints.
I don't think you've specified enough criteria to make the desired CDF unique.
If the only criteria that must hold is:
CDF must be "fairly smooth" (see below)
CDF must be non-decreasing
CDF must pass through the "error bar" y-intervals
CDF must tend toward 0 as x --> -Infinity
CDF must tend toward 1 as x --> Infinity.
then perhaps you could use Monotone Cubic Interpolation.
This will give you a C^2 (twice continously differentiable) function which,
unlike cubic splines, is guaranteed to be monotone when given monotone data.
This leaves open the question, exactly what data should you use to generate the
monotone cubic interpolation. If you take the center point (mean) of each error
bar, are you guaranteed that the resulting data points are monotonically
increasing? If not, you might as well make some arbitrary choice to guarantee
that the points you select are monotonically increasing (because the criteria does not force our solution to be unique).
Now what to do about the last data point? Is there an X which is guaranteed to
be larger than any x in the constraints data set? Perhaps you can again make an
arbitrary choice of convenience and pick some very large X and put (X,1) as the
final data point.
Comment 1: Your problem can be broken into 2 sub-problems:
Given exact points (x_i,y_i) through which the CDF must pass, how do you generate CDF? I suspect there are infinitely many possible solutions, even with the infinite-smoothness constraint.
Given y-errorbars, how should you pick (x_i,y_i)? Again, there infinitely many possible solutions. Some additional criteria may need to be added to force a unique choice. Additional criteria would also probably make the problem even harder than it currently is.
Comment 2: Here is a way to use monotonic cubic interpolation, and satisfy criteria 4 and 5:
The monotonic cubic interpolation (let's call it f) maps R --> R.
Let CDF(x) = exp(-exp(f(x))). Then CDF: R --> (0,1). If we could find the appropriate f, then by defining CDF this way, we could satisfy criteria 4 and 5.
To find f, transform the CDF constraints (x_0,y_0),...,(x_n,y_n) using the transformation xhat_i = x_i, yhat_i = log(-log(y_i)). This is the inverse of the CDF transformation. If the y_i's were increasing, then the yhat_i's are decreasing.
Now apply monotone cubic interpolation to the (x_hat,y_hat) data points to generate f. Then finally, define CDF(x) = exp(-exp(f(x))). This will be a monotonically increasing function from R --> (0,1), which passes through the points (x_i,y_i).
This, I think, satisfies all the criteria 2--5. Criteria 1 is somewhat satisfied, though there certainly could exist smoother solutions.
I have found a solution that gives reasonable results for a variety of inputs.
I start by fitting a model -- once to the low ends of the constraints, and again to the high ends.
I'll refer to the mean of these two fitted functions as the "ideal function".
I use this ideal function to extrapolate to the left and to the right of where the constraints end, as well as to interpolate between any gaps in the constraints.
I compute values for the ideal function at regular intervals, including all the constraints, from where the function is nearly zero on the left to where it's nearly one on the right.
At the constraints, I clip these values as necessary to satisfy the constraints.
Finally, I construct an interpolating function that goes through these values.
My Mathematica implementation follows.
First, a couple helper functions:
(* Distance from x to the nearest member of list l. *)
listdist[x_, l_List] := Min[Abs[x - #] & /# l]
(* Return a value x for the variable var such that expr/.var->x is at least (or
at most, if dir is -1) t. *)
invertish[expr_, var_, t_, dir_:1] := Module[{x = dir},
While[dir*(expr /. var -> x) < dir*t, x *= 2];
x]
And here's the main function:
(* Return a non-decreasing interpolating function that maps from the
reals to [0,1] and that is as close as possible to expr[var] without
violating the given constraints (a list of {x,ymin,ymax} triples).
The model, expr, will have free parameters, params, so first do a
model fit to choose the parameters to satisfy the constraints as well
as possible. *)
cfit[constraints_, expr_, params_, var_] :=
Block[{xlist,bots,tops,loparams,hiparams,lofit,hifit,xmin,xmax,gap,aug,bests},
xlist = First /# constraints;
bots = Most /# constraints; (* bottom points of the constraints *)
tops = constraints /. {x_, _, ymax_} -> {x, ymax};
(* fit a model to the lower bounds of the constraints, and
to the upper bounds *)
loparams = FindFit[bots, expr, params, var];
hiparams = FindFit[tops, expr, params, var];
lofit[z_] = (expr /. loparams /. var -> z);
hifit[z_] = (expr /. hiparams /. var -> z);
(* find x-values where the fitted function is very close to 0 and to 1 *)
{xmin, xmax} = {
Min#Append[xlist, invertish[expr /. hiparams, var, 10^-6, -1]],
Max#Append[xlist, invertish[expr /. loparams, var, 1-10^-6]]};
(* the smallest gap between x-values in constraints *)
gap = Min[(#2 - #1 &) ### Partition[Sort[xlist], 2, 1]];
(* augment the constraints to fill in any gaps and extrapolate so there are
constraints everywhere from where the function is almost 0 to where it's
almost 1 *)
aug = SortBy[Join[constraints, Select[Table[{x, lofit[x], hifit[x]},
{x, xmin,xmax, gap}],
listdist[#[[1]],xlist]>gap&]], First];
(* pick a y-value from each constraint that is as close as possible to
the mean of lofit and hifit *)
bests = ({#1, Clip[(lofit[#1] + hifit[#1])/2, {#2, #3}]} &) ### aug;
Interpolation[bests, InterpolationOrder -> 3]]
For example, we can fit to a lognormal, normal, or logistic function:
g1 = cfit[constraints, CDF[LogNormalDistribution[mu,sigma], z], {mu,sigma}, z]
g2 = cfit[constraints, CDF[NormalDistribution[mu,sigma], z], {mu,sigma}, z]
g3 = cfit[constraints, 1/(1 + c*Exp[-k*z]), {c,k}, z]
Here's what those look like for my original list of example constraints:
(source: yootles.com)
The normal and logistic are nearly on top of each other and the lognormal is the blue curve.
These are not quite perfect.
In particular, they aren't quite monotone.
Here's a plot of the derivatives:
Plot[{g1'[x], g2'[x], g3'[x]}, {x, 0, 10}]
(source: yootles.com)
That reveals some lack of smoothness as well as the slight non-monotonicity near zero.
I welcome improvements on this solution!
You can try to fit a Bezier curve through the midpoints. Specifically I think you want a C2 continuous curve.
This question on getting random values from a finite set got me thinking...
It's fairly common for people to want to retrieve X unique values from a set of Y values. For example, I may want to deal a hand from a deck of cards. I want 5 cards, and I want them to all be unique.
Now, I can do this naively, by picking a random card 5 times, and try again each time I get a duplicate, until I get 5 cards. This isn't so great, however, for large numbers of values from large sets. If I wanted 999,999 values from a set of 1,000,000, for instance, this method gets very bad.
The question is: how bad? I'm looking for someone to explain an O() value. Getting the xth number will take y attempts...but how many? I know how to figure this out for any given value, but is there a straightforward way to generalize this for the whole series and get an O() value?
(The question is not: "how can I improve this?" because it's relatively easy to fix, and I'm sure it's been covered many times elsewhere.)
Variables
n = the total amount of items in the set
m = the amount of unique values that are to be retrieved from the set of n items
d(i) = the expected amount of tries needed to achieve a value in step i
i = denotes one specific step. i ∈ [0, n-1]
T(m,n) = expected total amount of tries for selecting m unique items from a set of n items using the naive algorithm
Reasoning
The first step, i=0, is trivial. No matter which value we choose, we get a unique one at the first attempt. Hence:
d(0) = 1
In the second step, i=1, we at least need 1 try (the try where we pick a valid unique value). On top of this, there is a chance that we choose the wrong value. This chance is (amount of previously picked items)/(total amount of items). In this case 1/n. In the case where we picked the wrong item, there is a 1/n chance we may pick the wrong item again. Multiplying this by 1/n, since that is the combined probability that we pick wrong both times, gives (1/n)2. To understand this, it is helpful to draw a decision tree. Having picked a non-unique item twice, there is a probability that we will do it again. This results in the addition of (1/n)3 to the total expected amounts of tries in step i=1. Each time we pick the wrong number, there is a chance we might pick the wrong number again. This results in:
d(1) = 1 + 1/n + (1/n)2 + (1/n)3 + (1/n)4 + ...
Similarly, in the general i:th step, the chance to pick the wrong item in one choice is i/n, resulting in:
d(i) = 1 + i/n + (i/n)2 + (i/n)3 + (i/n)4 + ... = = sum( (i/n)k ), where k ∈ [0,∞]
This is a geometric sequence and hence it is easy to compute it's sum:
d(i) = (1 - i/n)-1
The overall complexity is then computed by summing the expected amount of tries in each step:
T(m,n) = sum ( d(i) ), where i ∈ [0,m-1] = = 1 + (1 - 1/n)-1 + (1 - 2/n)-1 + (1 - 3/n)-1 + ... + (1 - (m-1)/n)-1
Extending the fractions in the series above by n, we get:
T(m,n) = n/n + n/(n-1) + n/(n-2) + n/(n-3) + ... + n/(n-m+2) + n/(n-m+1)
We can use the fact that:
n/n ≤ n/(n-1) ≤ n/(n-2) ≤ n/(n-3) ≤ ... ≤ n/(n-m+2) ≤ n/(n-m+1)
Since the series has m terms, and each term satisfies the inequality above, we get:
T(m,n) ≤ n/(n-m+1) + n/(n-m+1) + n/(n-m+1) + n/(n-m+1) + ... + n/(n-m+1) + n/(n-m+1) = = m*n/(n-m+1)
It might be(and probably is) possible to establish a slightly stricter upper bound by using some technique to evaluate the series instead of bounding by the rough method of (amount of terms) * (biggest term)
Conclusion
This would mean that the Big-O order is O(m*n/(n-m+1)). I see no possible way to simplify this expression from the way it is.
Looking back at the result to check if it makes sense, we see that, if n is constant, and m gets closer and closer to n, the results will quickly increase, since the denominator gets very small. This is what we'd expect, if we for example consider the example given in the question about selecting "999,999 values from a set of 1,000,000". If we instead let m be constant and n grow really, really large, the complexity will converge towards O(m) in the limit n → ∞. This is also what we'd expect, since while chosing a constant number of items from a "close to" infinitely sized set the probability of choosing a previously chosen value is basically 0. I.e. We need m tries independently of n since there are no collisions.
If you already have chosen i values then the probability that you pick a new one from a set of y values is
(y-i)/y.
Hence the expected number of trials to get (i+1)-th element is
y/(y-i).
Thus the expected number of trials to choose x unique element is the sum
y/y + y/(y-1) + ... + y/(y-x+1)
This can be expressed using harmonic numbers as
y (Hy - Hy-x).
From the wikipedia page you get the approximation
Hx = ln(x) + gamma + O(1/x)
Hence the number of necessary trials to pick x unique elements from a set of y elements
is
y (ln(y) - ln(y-x)) + O(y/(y-x)).
If you need then you can get a more precise approximation by using a more precise approximation for Hx. In particular, when x is small it is possible to
improve the result a lot.
If you're willing to make the assumption that your random number generator will always find a unique value before cycling back to a previously seen value for a given draw, this algorithm is O(m^2), where m is the number of unique values you are drawing.
So, if you are drawing m values from a set of n values, the 1st value will require you to draw at most 1 to get a unique value. The 2nd requires at most 2 (you see the 1st value, then a unique value), the 3rd 3, ... the mth m. Hence in total you require 1 + 2 + 3 + ... + m = [m*(m+1)]/2 = (m^2 + m)/2 draws. This is O(m^2).
Without this assumption, I'm not sure how you can even guarantee the algorithm will complete. It's quite possible (especially with a pseudo-random number generator which may have a cycle), that you will keep seeing the same values over and over and never get to another unique value.
==EDIT==
For the average case:
On your first draw, you will make exactly 1 draw.
On your 2nd draw, you expect to make 1 (the successful draw) + 1/n (the "partial" draw which represents your chance of drawing a repeat)
On your 3rd draw, you expect to make 1 (the successful draw) + 2/n (the "partial" draw...)
...
On your mth draw, you expect to make 1 + (m-1)/n draws.
Thus, you will make 1 + (1 + 1/n) + (1 + 2/n) + ... + (1 + (m-1)/n) draws altogether in the average case.
This equals the sum from i=0 to (m-1) of [1 + i/n]. Let's denote that sum(1 + i/n, i, 0, m-1).
Then:
sum(1 + i/n, i, 0, m-1) = sum(1, i, 0, m-1) + sum(i/n, i, 0, m-1)
= m + sum(i/n, i, 0, m-1)
= m + (1/n) * sum(i, i, 0, m-1)
= m + (1/n)*[(m-1)*m]/2
= (m^2)/(2n) - (m)/(2n) + m
We drop the low order terms and the constants, and we get that this is O(m^2/n), where m is the number to be drawn and n is the size of the list.
There's a beautiful O(n) algorithm for this. It goes as follows. Say you have n items, from which you want to pick m items. I assume the function rand() yields a random real number between 0 and 1. Here's the algorithm:
items_left=n
items_left_to_pick=m
for j=1,...,n
if rand()<=(items_left_to_pick/items_left)
Pick item j
items_left_to_pick=items_left_to_pick-1
end
items_left=items_left-1
end
It can be proved that this algorithm does indeed pick each subset of m items with equal probability, though the proof is non-obvious. Unfortunately, I don't have a reference handy at the moment.
Edit The advantage of this algorithm is that it takes only O(m) memory (assuming the items are simply integers or can be generated on-the-fly) compared to doing a shuffle, which takes O(n) memory.
Your actual question is actually a lot more interesting than what I answered (and harder). I've never been any good at statistitcs (and it's been a while since I did any), but intuitively, I'd say that the run-time complexity of that algorithm would probably something like an exponential. As long as the number of elements picked is small enough compared to the size of the array the collision-rate will be so small that it will be close to linear time, but at some point the number of collisions will probably grow fast and the run-time will go down the drain.
If you want to prove this, I think you'd have to do something moderately clever with the expected number of collisions in function of the wanted number of elements. It might be possible do to by induction as well, but I think going by that route would require more cleverness than the first alternative.
EDIT: After giving it some thought, here's my attempt:
Given an array of m elements, and looking for n random and different elements. It is then easy to see that when we want to pick the ith element, the odds of picking an element we've already visited are (i-1)/m. This is then the expected number of collisions for that particular pick. For picking n elements, the expected number of collisions will be the sum of the number of expected collisions for each pick. We plug this into Wolfram Alpha (sum (i-1)/m, i=1 to n) and we get the answer (n**2 - n)/2m. The average number of picks for our naive algorithm is then n + (n**2 - n)/2m.
Unless my memory fails me completely (which entirely possible, actually), this gives an average-case run-time O(n**2).
The worst case for this algorithm is clearly when you're choosing the full set of N items. This is equivalent to asking: On average, how many times must I roll an N-sided die before each side has come up at least once?
Answer: N * HN, where HN is the Nth harmonic number,
a value famously approximated by log(N).
This means the algorithm in question is N log N.
As a fun example, if you roll an ordinary 6-sided die until you see one of each number, it will take on average 6 H6 = 14.7 rolls.
Before being able to answer this question in details, lets define the framework. Suppose you have a collection {a1, a2, ..., an} of n distinct objects, and want to pick m distinct objects from this set, such that the probability of a given object aj appearing in the result is equal for all objects.
If you have already picked k items, and radomly pick an item from the full set {a1, a2, ..., an}, the probability that the item has not been picked before is (n-k)/n. This means that the number of samples you have to take before you get a new object is (assuming independence of random sampling) geometric with parameter (n-k)/n. Thus the expected number of samples to obtain one extra item is n/(n-k), which is close to 1 if k is small compared to n.
Concluding, if you need m unique objects, randomly selected, this algorithm gives you
n/n + n/(n-1) + n/(n-2) + n/(n-3) + .... + n/(n-(m-1))
which, as Alderath showed, can be estimated by
m*n / (n-m+1).
You can see a little bit more from this formula:
* The expected number of samples to obtain a new unique element increases as the number of already chosen objects increases (which sounds logical).
* You can expect really long computation times when m is close to n, especially if n is large.
In order to obtain m unique members from the set, use a variant of David Knuth's algorithm for obtaining a random permutation. Here, I'll assume that the n objects are stored in an array.
for i = 1..m
k = randInt(i, n)
exchange(i, k)
end
here, randInt samples an integer from {i, i+1, ... n}, and exchange flips two members of the array. You only need to shuffle m times, so the computation time is O(m), whereas the memory is O(n) (although you can adapt it to only save the entries such that a[i] <> i, which would give you O(m) on both time and memory, but with higher constants).
Most people forget that looking up, if the number has already run, also takes a while.
The number of tries nessesary can, as descriped earlier, be evaluated from:
T(n,m) = n(H(n)-H(n-m)) ⪅ n(ln(n)-ln(n-m))
which goes to n*ln(n) for interesting values of m
However, for each of these 'tries' you will have to do a lookup. This might be a simple O(n) runthrough, or something like a binary tree. This will give you a total performance of n^2*ln(n) or n*ln(n)^2.
For smaller values of m (m < n/2), you can do a very good approximation for T(n,m) using the HA-inequation, yielding the formula:
2*m*n/(2*n-m+1)
As m goes to n, this gives a lower bound of O(n) tries and performance O(n^2) or O(n*ln(n)).
All the results are however far better, that I would ever have expected, which shows that the algorithm might actually be just fine in many non critical cases, where you can accept occasional longer running times (when you are unlucky).