Find a function between two arrays that minimises distance between pairs - r

I will explain my problem in general setting (as I am interested in a general algorithm), then decline it to my particular case.
Say we have two finite sets, A and B, both subsets of X and a distance function d that assigns a distance between any two points of X.
What is an algorithm to find two functions: f1 from A to B and f2 from B to A such that f1(a) is the element in B that is closest to a and the same viceversa for f2.
My special case is in R language, where I have two sets of points on earth (lat, lon) and I need to pair them up (from A to B and viceversa) according to their distance.
For reference, I am using the Haversine distance from geosphere package.
Thanks in advance.

Just mentioning, this is an algorithmic solution for an algorithmic problem.
Lets begin with a solution in O(n^2) time and memory complexity. For each element in A remember the distance from each element in B. Then iterate over this 2 dimensional array and for each row find its minimum - these elements are the image of f1, f2 is always the reverse function from f1.
Now we can create a similar solution in O(n log n) time complexity and O(n) memory complexity. Using a binary search.
Let's sort the elements in A in a way we can say what is the closest item to some item out of the set in O(log n). With numbers it can be done just by sorting them, with lon & lat you just need to sort them first by lon than by lat.
Now for each element in A search what is the closest item in B using binary search. It will take O(log n) per question. Now for each element we know which is the closest. O(n log n).

Related

Design an algorithm that minimises the load on the most heavily loaded server

Reading the book of Aziz & Prakash 2021 I am a bit stuck on problem 3.7 and the associated solution for which I am trying to implement.
The problem says :
You have n users with unique hashes h1 through hn and
m servers, numbered 1 to m. User i has Bi bytes to store. You need to
find numbers K1 through Km such that all users with hashes between
Kj and Kj+1 get assigned to server j. Design an algorithm to find the
numbers K 1 through Km that minimizes the load on the most heavily
loaded server.
The solution says:
Let L(a,b) be the maximum load on a server when
users with hash h1 through ha are assigned to servers S1 through Sb in
an optimal way so that the max load is minimised. We observe the
following recurrence:
In other words, we find the right value of x such that if we pack the
first x users in b - 1 servers and the remaining in the last servers the max
load on a given server is minimized.
Using this relationship, we can tabulate the values of L till we get
L(n,m). While computing L(a,b) when the values of L is tabulated
for all lower values of a and b we need to find the right value of x to
minimize the load. As we increase x, L(x,b-1) in the above expression increases the the sum term decreases. We can do binary search for x to find x that minimises their max.
I know that we can probably use some sort of dynamic programming, but how could we possibly implement this idea into a code?
The dynamic programming algorithm is defined fairly well given that formula: Implementing a top-down DP algorithm just needs you to loop from x = 1 to a and record which one minimizes that max(L(x,b-1), sum(B_i)) expression.
There is, however, a simpler (and faster) greedy/binary search algorithm for this problem that you should consider, which goes like this:
Compute prefix sums for B
Find the minimum value of L such that we can partition B into m contiguous subarrays whose maximum sum is equal to L.
We know 1 <= L <= sum(B). So, perform a binary search to find L, with a helper function canSplit(v) that tests whether we can split B into such subarrays of sum <= v.
canSplit(v) works greedily: Remove as many elements from the start of B as possible so that our sum does not exceed v. Repeat this a total of m times; return True if we've used all of B.
You can use the prefix sums to run canSplit in O(m log n) time, with an additional inner binary search.
Given L, use the same strategy as the canSplit function to determine the m-1 partition points; find the m partition boundaries from there.

Smallest perfect square divisible by all elements of an array (with large numbers)

Given an array A[] with n elements, the task is to find S mod (10^9+7), in which S is the smallest perfect square which is divisible by all the elements A[i] (1<=i<=n) of the given array.
So, the problem is very easy if the value of A[i] and n is small. But in this case, I don't know what to do when A[i] can up to 10^7 and n can up to 10^5. So everybody help me pls!
The smallest integer X which is a multiple of all the A_i is called the least common multiple of the A_i. It's also true that every common multiple of the A_i is divisible by X. So S is divisible by X, or equivalently S is a multiple of X.
The LCM can computed fairly efficiently by the algorithms mentioned in the wikipedia article, but remember our final goal is S, a perfect square, not X. Also, the size of X (and S) is likely to be enormous given the constraints in your problem.
Thus I think the correct approach is to use a modified Sieve of Eratosthenes (or just obtain from some online source a list of primes up to 3163) to completely factor all the A_i simultaneously into their prime power factorizations. Since the A_i < 107 you need only include primes <= 103.5. Now, with each A_i factored into its prime power factorization use the prime factorization method to find the LCM, but still retain this in prime power format, in other words don't yet multiply everything together. Next, scan through each of the powers and add 1 to any odd powers. Now you have the prime power factorization of S. Iterate through these prime powers, multiplying each one into the product and taking the product mod (109+7) at each step.

Set Theory & Geometry: Two arcs on the same circle overlap with wrapping values

As a background, I'm a computer programmer and I'm working on a software library that allows a computer to quickly search through all dates to find a set of dates that satisfies a criteria. For example:
I want a list of every possible time that has ever occurred that has occurred on a friday or a saturday that is in April or May during the first week of the month.
My library uses numerical sets to efficiently represent ranges of dates that satisfy a criteria.
I've been thinking about ways to improve the performance of some parts of the app and I think that by combining sets and some geometry, I can really improve my results. However, my geometry is a bit rusty and I was hoping you might could help.
Here's my thought:
Certain elements of time can be represented as a circular dial. For example, Minutes can be positioned on a clock with values between 0...59. We could store valid ranges as a list of arcs. For example, If we wanted all times that ended with 05..10, we could store [5,10]. If we wanted all times that end with :45-59 or :00-15, we could store [45, 15]. Notice how this last arc "loops around" the dial. Here's a mockup showing different ranges intersecting on a dial
My question is this:
Given a set of whole numbers between N...M arranged into a circle.
Given Arc1 which is representing by [A, B] and Arc2 which is represented by [C, D] where A, B, C, and D are all within in range N...M
How do I determine:
A. Whether the arcs intersect.
B. If they do, what their intersection is.
C. If they do, what their union is.
Thank you so much for your help. If you're not able to help, if you can point me in the right direction, that would be great.
Thanks!
A simple and safe approach is to split the intervals that straddle 0. Then you perform pairwise interval intersection/union (for instance if A < D and C < B then [max(A,C), min(B,D)] for the intersection), and merge them if they meet at 0.
It seems the primitive operation to implement would be something like 'is the number X contained in the arch [A,B]'. Once you have that, you could implement an [A,B]/[C,D] arch-intersection predicate by something like -
Arch intersection means exactly that at least one of the following conditions is met:
C is contained in [A,B]
D is contained in [A,B]
A is contained in [C,D]
B is contained in [C,D]
One way to implement this contained-in-arch test without any branches is with some trigonometry and vector cross product. Not sure it would be faster (the math/branches performance tradeoff is entirely empiric), but it might be worth a try.
Denote Xa = sin(X/N * 2PI), Ya = cos(X/N * 2PI) and similarly for Xb,Yb etc.
C is contained in [A,B] is equivalent to:
Xa * Yc - Ya * Xc > 0
AND
Xc * Yb - Yc * Xb > 0
You can complete the other 3 conditions in an identical manner.
Hope this turns out useful.

Mathematical function for string similarity score

I'm working on a string similarity algorithm, and was thinking on how to give a score between 0 and 1 when comparing two strings. The two variables for this function are the Levenshtein distance D: (added, removed and changed characters) and the maximum length of the two strings L (but you could also take the average).
My initial algorithm was just 1-D/L but this gave too high scores for short strings, e.g. 'tree' and 'bee' would get a score of 0.5, and too low scores for longer strings which have more in common even if half of the characters is different.
Now I'm looking for a mathematical function that can output a better score. I wasn't able to come up with one, so I sketched this height map of a 3D plot (L is x and D = y).
Does anyone know how to convert such a graph to an equation, if I would be better off to just create a lookup table or if there is an existing solution?

Calculating Cosine Similarity of two Vectors of Different Size

I have 2 questions,
I've made a vector from a document by finding out how many times each word appeared in a document. Is this the right way of making the vector? Or do I have to do something else also?
Using the above method I've created vectors of 16 documents, which are of different sizes. Now i want to apply cosine similarity to find out how similar each document is. The problem I'm having is getting the dot product of two vectors because they are of different sizes. How would i do this?
Sounds reasonable, as long as it means you have a list/map/dict/hash of (word, count) pairs as your vector representation.
You should pretend that you have zero values for the words that do not occur in some vector, without storing these zeros anywhere. Then, you can use the following algorithm to compute the dot product of these vectors (pseudocode):
algorithm dot_product(a : WordVector, b : WordVector):
dot = 0
for word, x in a do
y = lookup(word, b)
dot += x * y
return dot
The lookup part can be anything, but for speed, I'd use hashtables as the vector representation (e.g. Python's dict).

Resources