Take a UUID in its hex representation: '123e4567-e89b-12d3-a456-426655440000'
I have a lot of such UUIDs, and I want to separate them into N buckets, where N is of my choosing, and I want to generate the bounds of these buckets.
I can trivially create 16 buckets with these bounds:
00000000-0000-0000-0000-000000000000
10000000-0000-0000-0000-000000000000
20000000-0000-0000-0000-000000000000
30000000-0000-0000-0000-000000000000
...
e0000000-0000-0000-0000-000000000000
f0000000-0000-0000-0000-000000000000
ffffffff-ffff-ffff-ffff-ffffffffffff
just by iterating over the options for the first hex digit.
Suppose I want 50 equal size buckets(equal in terms of number of UUID possibilities contained within each bucket), or 2000 buckets, or N buckets.
How do I generate such bounds as a function of N?
Your UUIDs above are 32 hex digits in length. So that means you have 16^32 ≈ 3.4e38 possible UUIDs. A simple solution would be to use a big int library (or a method of your own) to store these very large values as actual numbers. Then, you can just divide the number of possible UUIDs by N (call that value k), giving you bucket bounds of 0, k, 2*k, ... (N-1)*k, UMAX.
This runs into a problem if N doesn't divide the number of possible UUIDs. Obviously, not every bucket will have the same number of UUIDs, but in this case, they won't even be evenly distributed. For example, if the number of possible UUIDs is 32, and you want 7 buckets, then k would be 4, so you would have buckets of size 4, 4, 4, 4, 4, 4, and 8. This probably isn't ideal. To fix this, you could instead make the bucket bounds at 0, (1*UMAX)/N, (2*UMAX)/N, ... ((N-1)*UMAX)/N, UMAX. Then, in the inconvenient case above, you would end up with bounds at 0, 4, 9, 13, 18, 22, 27, 32 -- giving bucket sizes of 4, 5, 4, 5, 4, 5, 5.
You will probably need a big int library or some other method to store large integers in order to use this method. For comparison, a long long in C++ (in some implementations) can only store up to 2^64 ≈ 1.8e19.
If N is a power of 2, then the solution is obvious: you can split on bit boundaries as for 16 buckets in your question.
If N is not a power of 2, the buckets mathematically cannot be of exactly equal size, so the question becomes how unequal are you willing to tolerate in the name of efficiency.
As long as N<2^24 or so, the simplest thing to do is just allocate UUIDs based on the first 32 bits into N buckets each of size 2^32/N. That should be fast enough and equal enough for most applications, and if N needs to be larger than that allows, you could easily double the bits with a smallish penalty.
Related
There is a method contains that can be used to check if a particular element exists in a Vec. How to check if all elements from a Vec are contained in another Vec? Is there something more concise than iterating manually and checking all elements explicitly?
You have two main choices:
naively check each element from one vector to see if it's in the other. This has time complexity O(n^2) but it's also very simple and has low overhead:
assert!(b.iter().all(|item| a.contains(item)));
create a set of all of elements of one of the vectors, and then check if elements of the other are contained it it. This has O(n) time complexity, but higher overhead including an extra heap allocation:
let a_set: HashSet<_> = a.iter().copied().collect();
assert!(b.iter().all(|item| a_set.contains(item)));
Which one is "better" will depend on your requirements. If you only care about speed, the better choice will still depend on the number of elements in your vectors, so you should test both with realistic data. You could also test with a BTreeSet, which has different performance characteristics from HashSet.
Here are some rough benchmarks (source) for how the implementations vary with the size of the input. In all tests, b is half the size of a and contains a random subset of a's elements:
Size of a
Vec::contains
HashSet::contains
BtreeSet::contains
10
14
386
327
100
1,754
3,187
5,371
1000
112,306
31,233
88,340
10000
2,821,867
254,801
728,268
100000
29,207,999
2,645,703
6,611,666
Times in nanoseconds.
The naive O(n^2) solution is fastest when the number of elements is small. The overhead of allocating a HashSet or BTreeSet is overshadowed by the impact of the number of comparisons when the size is more than about 200. BTreeSet is mostly a lot slower than HashSet, but is slightly faster when the number of elements is very small.
If you have sorted vectors, you can do the search in linear time:
let mut vec = vec![0, 2, 4, 3, 6, 3, 5, 1, 0];
let mut v = vec![1, 4, 3, 3, 1];
vec.sort_unstable();
v.sort_unstable();
// Remove duplicates elements in v
v.dedup();
let mut vec_iter = vec.iter();
assert!(v.iter().all(|&x| vec_iter.any(|&item| item == x)));
Reference: C++ has std::includes which does exactly that.
You could also sort the vectors and then test them for equality:
fn main() {
let mut v1 = vec![2, 3, 1];
let mut v2 = vec![3, 1, 2];
v1.sort();
v2.sort();
assert_eq!(v1, v2);
}
This is a bit of a math question, but I post it here too because there's a direct practical purpose and it's related to creating a faster algorithm. I want to identify users that use my app on a weekly basis. For each user I can generate a sequence of times of their interactions, and from that I can generate a sequence of the length of time between each interaction.
So given this sequence of lengths of time, how can I find sections of consecutive numbers that have an average of 7 days or less?
As an example, if I had the following sequence: [1, 11, 1, 8, 12]
[1, 11, 1, 8, 12] would be a valid stretch of numbers with an average of 7 or less, but [11, 1, 8, 12] would not be valid. [1, 2, 12] would again be valid.
Ideally, my output for every valid section would be the starting position of the first item and the length of the section. So [1, 11, 1, 8, 12] would be described as [1, 5] and [1, 2, 12] would be described as [3, 3].
There is a brute force, computational approach where I take every item in the sequence as a start point, and calculate the averages of every possible length of following numbers up until the end of the sequence. The number of calculations grows quickly though at a rate of n(n+1)/2 (Imagine for each given sequence of length N finding consecutive sequences of length N, N-1, N-2 etc.)
I ask broadly if there's a more elegant approach that doesn't require a quadratically growing number of individual calculations for means.
Let's say I want to generate all integers from 1-1000 in a random order. But...
No numbers are generated more then once
Without storing an Array, List... of all possible numbers
Without storing the already generated numbers.
Without missing any numbers in the end.
I think that should be impossible but maybe I'm just not thinking about the right solution.
I would like to use it in C# but I'm more interested in the approche then the actual implementation.
Encryption. An encryption is a one-to-one mapping between two sets. If the two sets are the same, then it is a permutation specified by the encryption key. Write/find an encryption that maps {0, 1000} onto itself. Read up on Format Preserving Encryption (FPE) to help you here.
To generate the random order just encrypt the numbers 0, 1, 2, ... in order. You don't need to store them, just keep track of how far you have got through the list.
From a practical point of view, numbers in {0, 1023} would be easier to deal with as that would be a block cipher with a 10 bit block size, and you could write a simple Feistel cipher to generate your numbers. You might want to do that anyway, and just re-encrypt numbers above 1000 -- the cycle walking method of FPE.
If randomness isn't a major concern, you could use a linear congruential generator. Since an LCG won't produce a maximal length sequences when the modulus is a prime number, you would need to choose a larger modulus (the next highest power of 2 would be an obvious choice) and skip any values outside the required range.
I'm afraid C# isn't really my thing, but hopefully the following Python is self-explanatory. It will need a bit of tweaking if you want to generate sequences over very small ranges:
# randint(a, b) returns a random integer in the range (a..b) (inclusive)
from random import randint
def lcg_params(u, v):
# Generate parameters for an LCG that produces a maximal length sequence
# of numbers in the range (u..v)
diff = v - u
if diff < 4:
raise ValueError("Sorry, range must be at least 4.")
m = 2 ** diff.bit_length() # Modulus
a = (randint(1, (m >> 2) - 1) * 4) + 1 # Random odd integer, (a-1) divisible by 4
c = randint(3, m) | 1 # Any odd integer will do
return (m, a, c, u, diff + 1)
def generate_pseudorandom_sequence(rmin, rmax):
(m, a, c, offset, seqlength) = lcg_params(rmin, rmax)
x = 1 # Start with a seed value of 1
result = [] # Create empty list for output values
for i in range(seqlength):
# To generate numbers on the fly without storing them in an array,
# just run the following while loop to fetch a new number
while True:
x = (x * a + c) % m # Iterate LCG until we get a value in the
if x < seqlength: break # required range
result.append(x + offset) # Add this value to the list
return result
Example:
>>> generate_pseudorandom_sequence(1, 20)
[4, 6, 8, 1, 10, 3, 12, 5, 14, 7, 16, 9, 18, 11, 20, 13, 15, 17, 19, 2]
Is it possible without using exponentiation to have a set of numbers that when added together, always give unique sum?
I know it can be done with exponentiation (see first answer): The right way to manage user privileges (user hierarchy)
But I'm wondering if it's possible without exponentiation.
No, you can only use exponentiation, because the sum of lower values have to be less than the new number to be unique: 1+2=3 < 4, 1+2+4=7 < 8.
[EDIT:]
This is a laymen's explanation, of course there are other possibilities, but none as efficient as using exponentials of 2.
There's a chance it can be done without exponentation (I'm no math expert), but not in any way that's more efficient than exponentation. This is because it only takes one bit of storage space per possible value, and as an added plus you can use boolean operators to do useful stuff with the values.
If you restrict yourself to integers, the numbers have to grow at least as fast as an exponential function. If you find a function that grows faster (like, oh, maybe the Ackermann function) then the numbers produced by that will probably work too.
With floating-point numbers, you can keep adding unique irreducible roots of primes (sqrt(2), sqrt(3), sqrt(5), ...) and you will always get something unique, up until you hit the limits of floating-point precision. Not sure how many unique numbers you could squeeze out of it - maybe you should try it.
No. To see this directly, think about building up the set of basis values by considering at each step the smallest possible positive integer that could be included as the next value. The next number to add must be different from all possible sums of the numbers already in the set (including the empty sum, which is 0), and can't combine with any combination of numbers already present to produce a duplicate. So...
{} : all possible sums = {0}, smallest possible next = 1
{1} : all possible sums = {0, 1}, smallest possible next = 2
{1, 2} : all possible sums = {0, 1, 2, 3}, smallest possible next = 4
{1, 2, 4} : a.p.s. = {0, 1, 2, 3, 4, 5, 6, 7}, s.p.n. = 8
{1, 2, 4, 8} ...
And, of course, we're building up the binary powers. You could start with something other than {1, 2}, but look what happens, using the "smallest possible next" rule:
{1, 3} : a.p.s. = {0, 1, 3, 4}, s.p.n. = 6 (because 2 could be added to 1 giving 3, which is already there)
{1, 3, 6} : a.p.s. = {0, 1, 3, 4, 6, 7, 9, 10}, s.p.n = 11
{1, 3, 6, 11} ...
This sequence is growing faster than the binary powers, term by term.
If you want a nice Project-Euler-style programming challenge, you could write a routine that takes a set of positive integers and determines the "smallest possible next" positive integer, under the "sums must be unique" constraint.
EDIT: Wow, many great responses. Yes, I am using this as a fitness function for judging the quality of a sort performed by a genetic algorithm. So cost-of-evaluation is important (i.e., it has to be fast, preferably O(n).)
As part of an AI application I am toying with, I'd like to be able to rate a candidate array of integers based on its monotonicity, aka its "sortedness". At the moment, I'm using a heuristic that calculates the longest sorted run, and then divides that by the length of the array:
public double monotonicity(int[] array) {
if (array.length == 0) return 1d;
int longestRun = longestSortedRun(array);
return (double) longestRun / (double) array.length;
}
public int longestSortedRun(int[] array) {
if (array.length == 0) return 0;
int longestRun = 1;
int currentRun = 1;
for (int i = 1; i < array.length; i++) {
if (array[i] >= array[i - 1]) {
currentRun++;
} else {
currentRun = 1;
}
if (currentRun > longestRun) longestRun = currentRun;
}
return longestRun;
}
This is a good start, but it fails to take into account the possibility that there may be "clumps" of sorted sub-sequences. E.g.:
{ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9}
This array is partitioned into three sorted sub-sequences. My algorithm will rate it as only 40% sorted, but intuitively, it should get a higher score than that. Is there a standard algorithm for this sort of thing?
This seems like a good candidate for Levenshtein Damerau–Levenshtein distance - the number of swaps needed to sort the array. This should be proportional to how far each item is from where it should be in a sorted array.
Here's a simple ruby algorithm that sums the squares of the distances. It seems a good measure of sortedness - the result gets smaller every time two out-of-order elements are swapped.
ap = a.sort
sum = 0
a.each_index{|i| j = ap.index(a[i])-i
sum += (j*j)
}
dist = sum/(a.size*a.size)
I expect that the choice of function to use depends very strongly on what you intend to use it for. Based on your question, I would guess that you are using a genetic system to create a sorting program, and this is to be the ranking function. If that is the case, then speed of execution is crucial. Based on that, I bet your longest-sorted-subsequence algorithm would work pretty well. That sounds like it should define fitness pretty well.
Something like these? http://en.wikipedia.org/wiki/Rank_correlation
Here's one I just made up.
For each pair of adjacent values, calculate the numeric difference between them. If the second is greater than or equal to the first, add that to the sorted total, otherwise add to the unsorted total. When done, take the ratio of the two.
Compute the lenghts of all sorted sub-sequences, then square them and add them.
If you want to calibrate how much enphasis you put on largest, use a power different than 2.
I'm not sure what's the best way to normalize this by length, maybe divide it per length squared?
What you're probably looking for is Kendall Tau. It's a one-to-one function of the bubble sort distance between two arrays. To test whether an array is "almost sorted", compute its Kendall Tau against a sorted array.
I would suggest looking at the Pancake Problem and the reversal distance of the permutations. These algorithms are often used to find the distance between two permutations (the Identity and the permuted string). This distance measure should take into account more clumps of in order values, as well as reversals (monotonically decreasing instead of increasing subsequences). There are also approximations that are polynomial time[PDF].
It really all depends on what the number means and if this distance function makes sense in your context though.
I have the same problem (monotonicity scoring), and I suggest you to try Longest Increasing Subsequence. The most efficient algorithm run in O(n log n), not so bad.
Taking example from the question, the longest increasing sequence of {4, 5, 6, 0, 1, 2, 3, 7, 8, 9} is {0, 1, 2, 3, 7, 8, 9} (length of 7). Maybe it rate better (70%) than your longest-sorted-run algorithm.
It highly depends on what you're intending to use the measure for, but one easy way to do this is to feed the array into a standard sorting algorithm and measure how many operations (swaps and/or comparisons) need to be done to sort the array.
Some experiments with a modifier Ratcliff & Obershelp
>>> from difflib import SequenceMatcher as sm
>>> a = [ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9 ]
>>> c = [ 0, 1, 9, 2, 8, 3, 6, 4, 7, 5 ]
>>> b = [ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9 ]
>>> b.sort()
>>> s = sm(None, a, b)
>>> s.ratio()
0.69999999999999996
>>> s2 = sm(None, c, b)
>>> s2.ratio()
0.29999999999999999
So kind of does what it needs to. Not too sure how to prove it though.
How about counting the number of steps with increasing value vs. the number of total steps. That's O(n).