Check if Vec contains all elements from another Vec

Check if Vec contains all elements from another Vec - vector

There is a method contains that can be used to check if a particular element exists in a Vec. How to check if all elements from a Vec are contained in another Vec? Is there something more concise than iterating manually and checking all elements explicitly?

You have two main choices:
naively check each element from one vector to see if it's in the other. This has time complexity O(n^2) but it's also very simple and has low overhead:
assert!(b.iter().all(|item| a.contains(item)));
create a set of all of elements of one of the vectors, and then check if elements of the other are contained it it. This has O(n) time complexity, but higher overhead including an extra heap allocation:
let a_set: HashSet<_> = a.iter().copied().collect();
assert!(b.iter().all(|item| a_set.contains(item)));
Which one is "better" will depend on your requirements. If you only care about speed, the better choice will still depend on the number of elements in your vectors, so you should test both with realistic data. You could also test with a BTreeSet, which has different performance characteristics from HashSet.
Here are some rough benchmarks (source) for how the implementations vary with the size of the input. In all tests, b is half the size of a and contains a random subset of a's elements:
Size of a
Vec::contains
HashSet::contains
BtreeSet::contains
10
14
386
327
100
1,754
3,187
5,371
1000
112,306
31,233
88,340
10000
2,821,867
254,801
728,268
100000
29,207,999
2,645,703
6,611,666
Times in nanoseconds.
The naive O(n^2) solution is fastest when the number of elements is small. The overhead of allocating a HashSet or BTreeSet is overshadowed by the impact of the number of comparisons when the size is more than about 200. BTreeSet is mostly a lot slower than HashSet, but is slightly faster when the number of elements is very small.

If you have sorted vectors, you can do the search in linear time:
let mut vec = vec![0, 2, 4, 3, 6, 3, 5, 1, 0];
let mut v = vec![1, 4, 3, 3, 1];
vec.sort_unstable();
v.sort_unstable();
// Remove duplicates elements in v
v.dedup();
let mut vec_iter = vec.iter();
assert!(v.iter().all(|&x| vec_iter.any(|&item| item == x)));
Reference: C++ has std::includes which does exactly that.

You could also sort the vectors and then test them for equality:
fn main() {
let mut v1 = vec![2, 3, 1];
let mut v2 = vec![3, 1, 2];
v1.sort();
v2.sort();
assert_eq!(v1, v2);
}

Related

What is the most efficient way to find all pairs of numbers from a list of integers which add up to a separate given integer?

I had an interview yesterday and was asked to give a method to find all of the pairs of numbers from a list which add up to an integer which is given separate to the list. The list can be infinitely long, but for example:
numbers = [11,1,5,27,7,18,2,4,8]
sum = 9
pairs = [(1,8),(5,4),(7,2)]
I got as far as sorting the list and eliminating all numbers greater than the sum number and then doing two nested for loops to take each index and iterate through the other numbers to check whether they sum up to the given number, but was told that there was a more efficient way of doing it...
I've been trying to figure it out but have nothing, other than doing the nested iteration backwards but that only seems marginally more efficient.
Any ideas?

This can be done in O(n) time and O(n) auxiliary space; testing for membership of a set takes O(1) time. Since the output also takes up to O(n) space, the auxiliary space should not be a significant issue.
def pairs_sum(numbers, k):
numbers_set = set(numbers)
return [(x, y) for x in numbers if (y := k - x) in numbers_set and x < y]
Example:
>>> pairs_sum([11, 1, 5, 27, 7, 18, 2, 4, 8], 9)
[(1, 8), (2, 7), (4, 5)]

It is kind of a classic and not sure stackoverflow is the right place to ask that kind of question.
Sort the list is acsending order
Two iterators one starting from the end of the list descending i1, one starting from the beginning of the list ascending i2
Loop
while i1 > i2
if (list[i1] + list[i2] == target)
store {list[i1], list[i2]) in results pairs
i1--
i2++
else if (list[i1] + list[i2] > target)
i1--
else if (list[i1] + list[i2] < target)
i2++
This should be in O(n) with n the length of the list if you avoid the sorting algorithm which can be done with a quick sort on average in O(n log n)
Note: this algorithm doesn't take into account the case where the input list have several times the same number

How to partition UUID space into N equal-size partitions?

Take a UUID in its hex representation: '123e4567-e89b-12d3-a456-426655440000'
I have a lot of such UUIDs, and I want to separate them into N buckets, where N is of my choosing, and I want to generate the bounds of these buckets.
I can trivially create 16 buckets with these bounds:
00000000-0000-0000-0000-000000000000
10000000-0000-0000-0000-000000000000
20000000-0000-0000-0000-000000000000
30000000-0000-0000-0000-000000000000
...
e0000000-0000-0000-0000-000000000000
f0000000-0000-0000-0000-000000000000
ffffffff-ffff-ffff-ffff-ffffffffffff
just by iterating over the options for the first hex digit.
Suppose I want 50 equal size buckets(equal in terms of number of UUID possibilities contained within each bucket), or 2000 buckets, or N buckets.
How do I generate such bounds as a function of N?

Your UUIDs above are 32 hex digits in length. So that means you have 16^32 ≈ 3.4e38 possible UUIDs. A simple solution would be to use a big int library (or a method of your own) to store these very large values as actual numbers. Then, you can just divide the number of possible UUIDs by N (call that value k), giving you bucket bounds of 0, k, 2*k, ... (N-1)*k, UMAX.
This runs into a problem if N doesn't divide the number of possible UUIDs. Obviously, not every bucket will have the same number of UUIDs, but in this case, they won't even be evenly distributed. For example, if the number of possible UUIDs is 32, and you want 7 buckets, then k would be 4, so you would have buckets of size 4, 4, 4, 4, 4, 4, and 8. This probably isn't ideal. To fix this, you could instead make the bucket bounds at 0, (1*UMAX)/N, (2*UMAX)/N, ... ((N-1)*UMAX)/N, UMAX. Then, in the inconvenient case above, you would end up with bounds at 0, 4, 9, 13, 18, 22, 27, 32 -- giving bucket sizes of 4, 5, 4, 5, 4, 5, 5.
You will probably need a big int library or some other method to store large integers in order to use this method. For comparison, a long long in C++ (in some implementations) can only store up to 2^64 ≈ 1.8e19.

If N is a power of 2, then the solution is obvious: you can split on bit boundaries as for 16 buckets in your question.
If N is not a power of 2, the buckets mathematically cannot be of exactly equal size, so the question becomes how unequal are you willing to tolerate in the name of efficiency.
As long as N<2^24 or so, the simplest thing to do is just allocate UUIDs based on the first 32 bits into N buckets each of size 2^32/N. That should be fast enough and equal enough for most applications, and if N needs to be larger than that allows, you could easily double the bits with a smallish penalty.

Piecewise / Noncontiguous Ranges?

Is there any kind of object class for piecewise / noncontiguous ranges in Julia? For instance, I can create a regular range:
a = UnitRange(1:5)
But, if I wanted to combine this with other ranges:
b = UnitRange([1:5, 8:10, 4:7])
I cannot currently find an object or method. There is a PiecewiseIncreasingRanges module (https://github.com/simonster/PiecewiseIncreasingRanges.jl) that would be just what I want in this situation, except that it, as the name implies, requires the ranges be monotonically increasing.
The context for this is that I am looking for a way to create a compressed, memory efficient version of the SparseMatrixCSC type for sparse matrices with repeating rows. The RLEVectors module will work well to save space on the nonzerovalue vector in the sparse matrix class. Now though I am trying to find something to save space for the rowvalue vector that also defines the sparse matrix, since series of repeating rows will result in ranges of values in that vector (e.g. if the first 10 rows, or even certain columns in the first ten rows, of a sparse matrix are identical, then there will be a lot of 1:10 patterns in the row value vector).
More generally, I'd like a range such as the b object that I try to create above over which I could do an iterated loop, getting:
for (idx, item) in enumerate(hypothetical_object)
println("idx: $idx, item: $item")
end
idx: 1, item: 1
idx: 2, item: 2
...
idx: 5, item: 5
idx: 6, item: 8
idx: 7, item: 9
idx: 8, item: 10
idx: 9, item: 4
idx: 10, item: 5
...
Update: One thing I'm considering, and will probably try implementing if I don't hear other suggestions here, will be to just create an array of PiecewiseIncreasingRange objects, one for each column in my sparse matrix. (I would probably also then break the nonzero value vector into an array of separate pieces, one for each column of my sparse matrix as well). This would at least be relatively simple to implement. I don't have a good sense off the bat how this would compare in terms of computational efficiency to the kind of object I am searching for in this question. I suspect that memory requirements would be about the same.

To loop over a sequence of ranges (or other iterators), you can use the chain function in the Iterators.jl package.
For example:
using Iterators
b = chain(1:5, 8:10, 4:7)
for i in b
println(i)
end
outputs the elements of each range.

Can recursion be dynamic programming?

I was asked to use dynamic programming to solve a problem. I have mixed notes on what constitutes dynamic programming. I believe it requires a "bottom-up" approach, where smallest problems are solved first.
One thing I have contradicting information on, is whether something can be dynamic programming if the same subproblems are solved more than once, as is often the case in recursion.
For instance. For Fibonacci, I can have a recursive algorithm:
RecursiveFibonacci(n)
if (n=1 or n=2)
return 1
else
return RecursiveFibonacci(n-1) + RecursiveFibonacci(n-2)
In this situation, the same sub-problems may be solved over-and-over again. Does this render it is not dynamic programming? That is, if I wanted dynamic programming, would I have to avoid resolving subproblems, such as using an array of length n and storing the solution to each subproblem (the first indices of the array are 1, 1, 2, 3, 5, 8, 13, 21)?
Fibonacci(n)
F1 = 1
F2 = 1
for i=3 to n
Fi=Fi-1 + Fi-2
return Fn

Dynamic programs can usually be succinctly described with recursive formulas.
But if you implement them with simple recursive computer programs, these are often inefficient for exactly the reason you raise: the same computation is repeated. Fibonacci is a example of repeated computation, though it is not a dynamic program.
There are two approaches to avoiding the repetition.
Memoization. The idea here is to cache the answer computed for each set of arguments to the recursive function and return the cached value when it exists.
Bottom-up table. Here you "unwind" the recursion so that results at levels less than i are combined to the result at level i. This is usually depicted as filling in a table, where the levels are rows.
One of these methods is implied for any DP algorithm. If computations are repeated, the algorithm isn't a DP. So the answer to your question is "yes."
So an example... Let's try the problem of making change of c cents given you have coins with values v_1, v_2, ... v_n, using a minimum number of coins.
Let N(c) be the minimum number of coins needed to make c cents. Then one recursive formulation is
N(c) = 1 + min_{i = 1..n} N(c - v_i)
The base cases are N(0)=0 and N(k)=inf for k<0.
To memoize this requires just a hash table mapping c to N(c).
In this case the "table" has only one dimension, which is easy to fill in. Say we have coins with values 1, 3, 5, then the N table starts with
N(0) = 0, the initial condition.
N(1) = 1 + min(N(1-1), N(1-3), N(1-5) = 1 + min(0, inf, inf) = 1
N(2) = 1 + min(N(2-1), N(2-3), N(2-5) = 1 + min(1, inf, inf) = 2
N(3) = 1 + min(N(3-1), N(3-3), N(3-5) = 1 + min(2, 0, inf) = 1
You get the idea. You can always compute N(c) from N(d), d < c in this manner.
In this case, you need only remember the last 5 values because that's the biggest coin value. Most DPs are similar. Only a few rows of the table are needed to get the next one.
The table is k-dimensional for k independent variables in the recursive expression.

We think of a dynamic programming approach to a problem if it has
overlapping subproblems
optimal substructure
In very simple words we can say dynamic programming has two faces, they are top-down and bottom-up approaches.
In your case, it is a top-down approach if you are talking about the recursion.
In the top-down approach, we will try to write a recursive solution or a brute-force solution and memoize the results so that we will try to use that result when a similar subproblem arrives, so it is brute-force + memoization. We can achieve that brute-force approach with a simple recursive relation.

Algorithm for rating the monotonicity of an array (i.e. judging the "sortedness" of an array)

EDIT: Wow, many great responses. Yes, I am using this as a fitness function for judging the quality of a sort performed by a genetic algorithm. So cost-of-evaluation is important (i.e., it has to be fast, preferably O(n).)
As part of an AI application I am toying with, I'd like to be able to rate a candidate array of integers based on its monotonicity, aka its "sortedness". At the moment, I'm using a heuristic that calculates the longest sorted run, and then divides that by the length of the array:
public double monotonicity(int[] array) {
if (array.length == 0) return 1d;
int longestRun = longestSortedRun(array);
return (double) longestRun / (double) array.length;
}
public int longestSortedRun(int[] array) {
if (array.length == 0) return 0;
int longestRun = 1;
int currentRun = 1;
for (int i = 1; i < array.length; i++) {
if (array[i] >= array[i - 1]) {
currentRun++;
} else {
currentRun = 1;
}
if (currentRun > longestRun) longestRun = currentRun;
}
return longestRun;
}
This is a good start, but it fails to take into account the possibility that there may be "clumps" of sorted sub-sequences. E.g.:
{ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9}
This array is partitioned into three sorted sub-sequences. My algorithm will rate it as only 40% sorted, but intuitively, it should get a higher score than that. Is there a standard algorithm for this sort of thing?

This seems like a good candidate for Levenshtein Damerau–Levenshtein distance - the number of swaps needed to sort the array. This should be proportional to how far each item is from where it should be in a sorted array.
Here's a simple ruby algorithm that sums the squares of the distances. It seems a good measure of sortedness - the result gets smaller every time two out-of-order elements are swapped.
ap = a.sort
sum = 0
a.each_index{|i| j = ap.index(a[i])-i
sum += (j*j)
}
dist = sum/(a.size*a.size)

I expect that the choice of function to use depends very strongly on what you intend to use it for. Based on your question, I would guess that you are using a genetic system to create a sorting program, and this is to be the ranking function. If that is the case, then speed of execution is crucial. Based on that, I bet your longest-sorted-subsequence algorithm would work pretty well. That sounds like it should define fitness pretty well.

Something like these? http://en.wikipedia.org/wiki/Rank_correlation

Here's one I just made up.
For each pair of adjacent values, calculate the numeric difference between them. If the second is greater than or equal to the first, add that to the sorted total, otherwise add to the unsorted total. When done, take the ratio of the two.

Compute the lenghts of all sorted sub-sequences, then square them and add them.
If you want to calibrate how much enphasis you put on largest, use a power different than 2.
I'm not sure what's the best way to normalize this by length, maybe divide it per length squared?

What you're probably looking for is Kendall Tau. It's a one-to-one function of the bubble sort distance between two arrays. To test whether an array is "almost sorted", compute its Kendall Tau against a sorted array.

I would suggest looking at the Pancake Problem and the reversal distance of the permutations. These algorithms are often used to find the distance between two permutations (the Identity and the permuted string). This distance measure should take into account more clumps of in order values, as well as reversals (monotonically decreasing instead of increasing subsequences). There are also approximations that are polynomial time[PDF].
It really all depends on what the number means and if this distance function makes sense in your context though.

I have the same problem (monotonicity scoring), and I suggest you to try Longest Increasing Subsequence. The most efficient algorithm run in O(n log n), not so bad.
Taking example from the question, the longest increasing sequence of {4, 5, 6, 0, 1, 2, 3, 7, 8, 9} is {0, 1, 2, 3, 7, 8, 9} (length of 7). Maybe it rate better (70%) than your longest-sorted-run algorithm.

It highly depends on what you're intending to use the measure for, but one easy way to do this is to feed the array into a standard sorting algorithm and measure how many operations (swaps and/or comparisons) need to be done to sort the array.

Some experiments with a modifier Ratcliff & Obershelp
>>> from difflib import SequenceMatcher as sm
>>> a = [ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9 ]
>>> c = [ 0, 1, 9, 2, 8, 3, 6, 4, 7, 5 ]
>>> b = [ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9 ]
>>> b.sort()
>>> s = sm(None, a, b)
>>> s.ratio()
0.69999999999999996
>>> s2 = sm(None, c, b)
>>> s2.ratio()
0.29999999999999999
So kind of does what it needs to. Not too sure how to prove it though.

How about counting the number of steps with increasing value vs. the number of total steps. That's O(n).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex