Find intersection of multiple skip lists - information-retrieval

This is the algorithm for finding the intersection of two skip lists:
(Finding Intersection of two skip lists - copyright to Stanford)
We can see that the "jumping by skips" benefits a lot in terms of efficiency compared to moving one step at a time.
But here I'm curious, what if the case is extended to multiple skip lists, say 100 lists? Currently, I only think of divide and conquer, in which the multiple skip lists are grouped by 2, and sequentially derive its intersection and later merge the solution, which sounds time-consuming and inefficient.
What is the better way to determine the intersections of multiple skip lists with the least time spent?

Initialize a pointer to the beginning of each of your skip lists.
We will maintain two things:
The current max value pointed to
a min-heap of (value, pointer) pairs.
At each step:
Check if all pointers have the same value by comparing the top of the min-heap with the max value.
If those two are are the same:
All current values must be the same (since min == max), so the value is in the intersection.
Add that value to the output.
Pop your min-heap, advance its pointer until it gets to a bigger value, and push the new value. Update max to the new value.
Else:
Pop your min-heap, advance its pointer towards the max value, skipping as needed.
If its new value exceeds the max value, update the max value.
Push the new value onto your min-heap.
Stop when any list runs out (you need to advance a pointer but can't.)
This is a slight twist on a classic programming interview problem "Merge k sorted lists" -- the algorithm here is very similar. I'd suggest looking at that if anything in this answer is unclear.

Related

Find the first root and local maximum/minimum of a function

Problem
I want to find
The first root
The first local minimum/maximum
of a black-box function in a given range.
The function has following properties:
It's continuous and differentiable.
It's combination of constant and periodic functions. All periods are known.
(It's better if it can be done with weaker assumptions)
What is the fastest way to get the root and the extremum?
Do I need more assumptions or bounds of the function?
What I've tried
I know I can use root-finding algorithm. What I don't know is how to find the first root efficiently.
It needs to be fast enough so that it can run within a few miliseconds with precision of 1.0 and range of 1.0e+8, which is the problem.
Since the range could be quite large and it should be precise enough, I can't brute-force it by checking all the possible subranges.
I considered bisection method, but it's too slow to find the first root if the function has only one big root in the range, as every subrange should be checked.
It's preferable if the solution is in java, but any similar language is fine.
Background
I want to calculate when arbitrary celestial object reaches certain height.
It's a configuration-defined virtual object, so I can't assume anything about the object.
It's not easy to get either analytical solution or simple approximation because various coordinates are involved.
I decided to find a numerical solution for this.
For a general black box function, this can't really be done. Any root finding algorithm on a black box function can't guarantee that it has found all the roots or any particular root, even if the function is continuous and differentiable.
The property of being periodic gives a bit more hope, but you can still have periodic functions with infinitely many roots in a bounded domain. Given that your function relates to celestial objects, this isn't likely to happen. Assuming your periodic functions are sinusoidal, I believe you can get away with checking subranges on the order of one-quarter of the shortest period (out of all the periodic components).
Maybe try Brent's Method on the shortest quarter period subranges?
Another approach would be to apply your root finding algorithm iteratively. If your range is (a, b), then apply your algorithm to that range to find a root at say c < b. Then apply your algorithm to the range (a, c) to find a root in that range. Continue until no more roots are found. The last root you found is a good candidate for your minimum root.
Black box function for any range? You cannot even be sure it has the continuous domain over that range. What kind of solutions are you looking for? Natural numbers, integers, real numbers, complex? These are all the question that greatly impact the answer.
So 1st thing should be determining what kind of number you accept as the result.
Second is having some kind of protection against limes of function that will try to explode your calculations as it goes for plus or minus infinity.
Since we are touching the limes topics you could have your solution edge towards zero and look like a solution but never touch 0 and become a solution. This depends on your margin of error, how close something has to be to be considered ok, it's good enough.
I think for this your SIMPLEST TO IMPLEMENT bet for real number solutions (I assume those) is to take an interval and this divide and conquer algorithm:
Take lower and upper border and middle value (or approx middle value for infinity decimals border/borders)
Try to calculate solution with all 3 and have some kind of protection against infinities
remember all 3 values in an array with results from them (3 pair of values)
remember the current best value (one its closest to solution) in seperate variable (a pair of value and result for that value)
STEP FORWARD - repeat above with 1st -2nd value range and 2nd -3rd value range
have a new pair of value and result to be closest to solution.
clear the old value-result pairs, replace them with new ones gotten from this iteration while remembering the best value solution pair (total)
Repeat above for how precise you wish to get and look at that memory explode with each iteration, keep in mind you are gonna to have exponential growth of values there. It can be further improved if you lets say take one interval and go as deep as you wanna, remember best value-result pair and then delete all other memory and go for next interval and dig deep.

Determine if there exists a number in the array occurring k times

I want to create a divide and conquer algorithm (O(nlgn) runtime) to determine if there exists a number in an array that occurs k times. A constraint on this problem is that only a equality/inequality comparison method is defined on the objects of the array (i.e can't use <, >).
So I have tried a number of approaches including splitting the array into k pieces of equal size (approximately). The approach is similar to finding the majority item in an array, however in the majority case when you split the array, you know that one half must have a majority item if such an item exists. Any pointers or tips that one could provide to put me in the right direction ?
EDIT: To clear up a little, I am wondering whether the problem of finding the majority item by splitting the array in half and using a recursive solution can be extended to other situations where k may be n/4 or n/5 etc.
Maybe I should of phrased the question using n/k instead.
This is impossible. As a simple example of why this is impossible, consider an input with a length-n array, all elements distinct, and k=2. The only way to be sure no element appears twice is to compare every element against every other element, which takes O(n^2) time. Until you perform all possible comparisons, you cannot be sure that some pair you didn't compare isn't actually equal.

How to make use of Hadoop MapReduce to calculate max or min of an expression with tens of variables

Given a math expression with tens of variables, each of which can be assigned with several candidate values. For instance, the math expression contains 20 variables, each of which can be assigned with 1, 2, or 3. In this can case there are 3^20 ways of assignment permutation.
If we want to get the max or min value among all the possible assignment permutations to the expression in brute-force manner, because of the exponential scale of assignment permutation candidates, it will definitely take way long time for a standalone computer to finish the computation. Then I thought of Hadoop MapReduce, i.e., to compute all the assignment permutations in the Mapper and then do the aggregation in the Combiner and Mapper.
A very direct but awkward solution is to save all the assignment permutation candidates into a file and load it to HDFS, then the Hadoop MapReduce can simply complete. But as I mentioned at the beginning, the direct input is actually only an expression, the assignment permutations should also be calculated with a program (say Java) anyway.
Therefore, hereby I would like to ask that is it possible to finish the whole process of getting the min/max value among all the assignment permutations without loading the assignment permutations to HDFS?
Say the function you are computing is f(x)=ax^2+bx+c;
Your input file should contain values for a,b,c and x in some fixed order.
You can write a custom mapper to compute the function value and output (any key, sum computed) it to the reducer.
Then write a custom reducer that will, for every keyset find the max and min value and write it to output file.
Hope this helps.

Find combination of numbers close to X

Imagine you have a list of N numbers. You also have the "target" number. You want to find the combination of Z numbers that summed together are close to the target.
Example:
Target = 3.085
List = [0.87, 1.24, 2.17, 1.89]
Output:
[0.87, 2.17] = 3.04 (0.045 offset)
In the example above you would get the group [0.87, 2.17] because it has the smallest offset from the target of 0.045. It's a list of 2 numbers but it could be more or less.
My question is what is the best way/algorithm (fastest) to solve this problem? I'm thinking a recursive approach but not yet exactly sure how. What is your opinion on this problem?
This is a knapsack problem. To solve it you would do the following:
def knap(numbers,target):
values = Set()
values.add(0)
for v in values:
for n in numbers:
if v+n<(2*target): # this is optional..
values.add(v+n);
for v in values:
# find the closest item to your target
Essentially, you are building up all of the possible sums of the numbers. If you have integral values, you can make this even faster by using an array instead of a set.
Intuitively, I would start by sorting the list. (Use your favorite algorithm.) Then find the index of the largest element that is smaller than the target. From there, pick the largest element that is less than the target, and combine it with the smallest element. That would probably be your baseline offset. If it is a negative offset, you can keep looking for combinations using bigger numbers; if it is a positive offset you can keep looking for combinations using smaller numbers. At that point recursion might be appropriate.
This doesn't yet address the need for 'Z' numbers, of course, but it's a step in the right direction and can be generalized.
Of course, depending on the size of the problem the "fastest" way might well be to divide up the possible combinations, assign them to a group of machines, and have each one do a brute-force calculation on its subset. Depends on how the question is phrased. :)

Are the x,y and row, col attributes of a two-dimensional array backwards?

If I think of the x,y coordinate plane, x,y is the common notation for an ordered pair, but if I use a two-dime array I have myArray[row][col] and row is the y and col is the x. Is that backwards or am I just thinking about it wrong? I was thinking it would look like myArray[x][y] but that's wrong if I want real rows and columns (like in a gameboard.) Wouldn't it be myArray[y][x] to truly mimic a row column board?
You have it right, and it does feel a bit backwards. The row number is a y coordinate, and the column number is an x coordinate, and yet we usually write row,col but we also usually write x,y.
Whether you want to write your array as [y][x] or [x][y] depends mostly on how much you actually care about the layout of your array in memory (and if you do, what language you use). And whether you want to write functions/methods that can operate on rows or columns in isolation.
If you are writing C/C++ code, arrays are stored in Row Major Order which means that a single row of data can be treated as 1 dimensional array. But a single column of data cannot. If I remember correctly, VB uses column major order, so languages vary. I'd be surprised of C# isn't also row major order, but I don't know.
This is what I do for my own sanity:
int x = array[0].length;
int y = array.length;
And then for every single array call I make, I write:
array[y][x]
This is particulary useful for graphing algorithms and horizontal/vertical matrix flipping.
It doesn't matter how you store your data in the array ([x][y] or [y][x]). What does matter is that you always loop over the array in a contiguous way. A java two dimensional array is essentially a one dimensional array storing the second array (eg. in the case of [y][x] you have a long array of [y] in which each y holds the corresponding [x] arrays for that line of y).
To efficiently run through the whole array, it's important to access the data in a way so that you don't continuously have to do searches in that array, jumping from one y-array-of-xarrays to another y-array-of-xarrays. What you want to do is access one y element and access all the x's in there before moving to the next y element.
So in an Array[y][x] situation. always have the first variable in the outer loop and the second in the inner loop:
for (int ys = 0; ys < Array.length; ys++)
for (int xs = 0; xs < Array[y].length; xs++)
{
do your stuff here
}
And of course pre-allocate both Array.lengths out of the loop to prevent having to get those values every cycle.
I love the question. You’re absolutely right. Most of the time we are either thinking (x, y) or (row, col). It was years before I questioned it. Then one day I realized that I always processed for loops as if x was a row and y was a column, though in plane geometry it’s actually the opposite. As mentioned by many, it really doesn’t matter in most cases, but consistency is a beautiful thing.
Actually, It's up to you. There is no right of thinking in your question. For example i usually think of a one-dimension array as a row of cell. So, in my mind it is array[col][row]. But it is really up to you...
I bet there are a lot of differing opinions on this one. Bottom line is, it doesn't really matter as long as you are consistent. If you have other libraries or similar that is going to use the same data it might make sense to do whatever they do for easier integration.
If this is strictly in your own code, do whatever you feel comfortable with. My personal preference would be to use myArray[y][x]. If they are large, there might be performance benefits of keeping the items that you are going to access a lot at the same time together. But I wouldn't worry about that until at a very late stage if at all.
Well not really, if you think of a row as elements on the x axis and then a 2d array is a bunch of row elements on the y axis, then it's normal to use y to operate on a row, as you already know the x (for that particular row x is always the same, it's y that's changing with its indices) and then use x to operate on the multiple row elements (the rows are stacked vertically, each one at a particular y value)
For better or for worse, the inconsistent notation was inherited from math.
Multidimensional arrays follow matrix notation where Mi,j represents the matrix element on row i and column j.
Multidimensional arrays therefore are not backward if used to represent a matrix, but they will seem backward if used to represent a 2D Cartesian plane where (x, y) is the typical ordering for a coordinate.
Also note that 2D Cartesian planes typically are oriented with the y-axis growing upward. However, that also is backward from how 2D arrays/matrices are typically visualized (and with the coordinate systems for most raster images).

Resources