Getting Least Common Equal Sum and Combination of Two Sets of Numbers - math

I'm currently creating a program in C# that would look for the lowest possible equal sum of two sets of numbers, in which you may repeat the numbers as many times as you want.
For example, I have these two sets { 10, 13, 18 } and { 12, 16, 22 }. The lowest possible sum that I can get is 28: (10 + 18) and (12 + 16).
Another example is {5, 7, 9} and {1, 2, 3}. Lowest possible sum is 5: (5) and (1+1+1+1+1) or (1+2+2) or (2+3) and so on.
Any suggestions on where I can start? I'll actually be using 6 numbers per set and the numbers are in the hundreds / thousands mark.

Maintain two sets, initialized from your input sets, and ordered by increasing numbers (e.g. using a tree-based set structure). Now compare the first (i.e. minimal) elements from the two sets. The smaller one you remove from its set, add all elements from the corresponding input set to that value, and insert the results. When both sets have the same minimal element, then you are done and that element is your least common equal sum.

Related

Optimizing (minimizing) the number of lines in file: an optimization problem in line with permutations and agenda scheduling

I have a calendar, typically a csv file containing a number of lines. Each line corresponds to an individual and is a sequence of consecutive values '0' and '1' where '0' refers to an empty time slot and '1' to an occupied slot. There cannot be two separated sequences in a line (e.g. two sequences of '1' separated by a '0' such as '1,1,1,0,1,1,1,1').
The problem is to minimize the number of lines by combining the individuals and resolving the collisions between time slots. Note the time slots cannot overlap. For example, for 4 individuals, we have the following sequences:
id1:1,1,1,0,0,0,0,0,0,0
id2:0,0,0,0,0,0,1,1,1,1
id3:0,0,0,0,1,0,0,0,0,0
id4:1,1,1,1,0,0,0,0,0,0
One can arrange them to end up with two lines, while keeping track of permuted individuals (for the record). In our example it yields:
1,1,1,0,1,0,1,1,1,1 (id1 + id2 + id3)
1,1,1,1,0,0,0,0,0,0 (id4)
The constraints are the following:
The number of individuals range from 500 to 1000,
The length of the sequence will never exceed 30
Each sequence in the file has the exact same length,
The algorithm needs to be reasonable in execution time because this task may be repeated up to 200 times.
We don't necessarly search for the optimal solution, a near optimal solution would suffice.
We need to keep track of the combined individuals (as in the example above)
Genetic algorithms seems a good option but how does it scales (in terms of execution time) with the size of this problem?
A suggestion in Matlab or R would be (greatly) appreciated.
Here is a sample test:
id1:1,1,1,0,0,0,0,0,0,0
id2:0,0,0,0,0,0,1,1,1,1
id3:0,0,0,0,1,0,0,0,0,0
id4:1,1,1,1,1,0,0,0,0,0
id5:0,1,1,1,0,0,0,0,0,0
id6:0,0,0,0,0,0,0,1,1,1
id7:0,0,0,0,1,1,1,0,0,0
id8:1,1,1,1,0,0,0,0,0,0
id9:1,1,0,0,0,0,0,0,0,0
id10:0,0,0,0,0,0,1,1,0,0
id11:0,0,0,0,1,0,0,0,0,0
id12:0,1,1,1,0,0,0,0,0,0
id13:0,0,0,1,1,1,0,0,0,0
id14:0,0,0,0,0,0,0,0,0,1
id15:0,0,0,0,1,1,1,1,1,1
id16:1,1,1,1,1,1,1,1,0,0
Solution(s)
#Nuclearman provided a working solution in O(NT) (where N is the number of individuals (ids) and T is the number of time slots (columns)) based on the Greedy algorithm.
I study algorithms as a hobby and I agree with Caduchon on this one, that greedy is the way to go. Though I believe this is actually the clique cover problem, to be more accurate: https://en.wikipedia.org/wiki/Clique_cover
Some ideas on how to approach building cliques can be found here: https://en.wikipedia.org/wiki/Clique_problem
Clique problems are related to independence set problems.
Considering the constraints, and that I'm not familiar with matlab or R, I'd suggest this:
Step 1. Build the independence set time slot data. For each time slot that is a 1, create a mapping (for fast lookup) of all records that also have a one. None of these can be merged into the same row (they all need to be merged into different rows). IE: For the first column (slot), the subset of the data looks like this:
id1 :1,1,1,0,0,0,0,0,0,0
id4 :1,1,1,1,1,0,0,0,0,0
id8 :1,1,1,1,0,0,0,0,0,0
id9 :1,1,0,0,0,0,0,0,0,0
id16:1,1,1,1,1,1,1,1,0,0
The data would be stored as something like 0: Set(id1,id4,id8,id9,id16) (zero indexed rows, we start at row 0 instead of row 1 though probably doesn't matter here). Idea here is to have O(1) lookup. You may need to quickly tell that id2 is not in that set. You can also use nested hash tables for that. IE: 0: { id1: true, id2: true }`. Sets also allow for usage of set operations which may help quite a bit when determining unassigned columns/ids.
In any case, none of these 5 can be grouped together. That means at best (given that row) you must have at least 5 rows (if the other rows can be merged into those 5 without conflict).
Performance of this step is O(NT), where N is the number of individuals and T is the number of time slots.
Step 2. Now we have options of how to attack things. To start, we pick the time slot with the most individuals and use that as our starting point. That gives us the min possible number of rows. In this case, we actually have a tie, where the 2nd and 5th rows both have 7. I'm going with the 2nd, which looks like:
id1 :1,1,1,0,0,0,0,0,0,0
id4 :1,1,1,1,1,0,0,0,0,0
id5 :0,1,1,1,0,0,0,0,0,0
id8 :1,1,1,1,0,0,0,0,0,0
id9 :1,1,0,0,0,0,0,0,0,0
id12:0,1,1,1,0,0,0,0,0,0
id16:1,1,1,1,1,1,1,1,0,0
Step 3. Now that we have our starting groups we need to add to them while trying to avoid conflicts between new members and old group members (which would require an additional row). This is where we get into NP-complete territory as there are a ton (roughly 2^N to be more accurately) to assign things.
I think the best approach might be a random approach as you can theoretically run it as many times as you have time for to get results. So here is the randomized algorithm:
Given the starting column and ids (1,4,5,8,9,12,16 above). Mark this column and ids as assigned.
Randomly pick an unassigned column (time slot). If you want a perhaps "better" result. Pick the one with the least (or most) number of unassigned ids. For faster implementation, just loop over the columns.
Randomly pick an unassigned id. For a better result, perhaps the one with the most/least groups that could be assigned that ID. For faster implementation, just pick the first unassigned id.
Find all groups that unassigned ID could be assigned to without creating conflict.
Randomly assign it to one of them. For faster implementation, just pick the first one that doesn't cause a conflict. If there are no groups without conflict, create a new group and assign the id to that group as the first id. The optimal result is that no new groups have to be created.
Update the data for that row (make 0s into 1s as needed).
Repeat steps 3-5 until no unassigned ids for that column remain.
Repeat steps 2-6 until no unassigned columns remain.
Example with the faster implementation approach, which is an optimal result (there cannot be less than 7 rows and there are only 7 rows in the result).
First 3 columns: No unassigned ids (all have 0). Skipped.
4th Column: Assigned id13 to id9 group (13=>9). id9 Looks like this now, showing that the group that started with id9 now also includes id13:
id9 :1,1,0,1,1,1,0,0,0,0 (+id13)
5th Column: 3=>1, 7=>5, 11=>8, 15=>12
Now it looks like:
id1 :1,1,1,0,1,0,0,0,0,0 (+id3)
id4 :1,1,1,1,1,0,0,0,0,0
id5 :0,1,1,1,1,1,1,0,0,0 (+id7)
id8 :1,1,1,1,1,0,0,0,0,0 (+id11)
id9 :1,1,0,1,1,1,0,0,0,0 (+id13)
id12:0,1,1,1,1,1,1,1,1,1 (+id15)
id16:1,1,1,1,1,1,1,1,0,0
We'll just quickly look the next columns and see the final result:
7th Column: 2=>1, 10=>4
8th column: 6=>5
Last column: 14=>4
So the final result is:
id1 :1,1,1,0,1,0,1,1,1,1 (+id3,id2)
id4 :1,1,1,1,1,0,1,1,0,1 (+id10,id14)
id5 :0,1,1,1,1,1,1,1,1,1 (+id7,id6)
id8 :1,1,1,1,1,0,0,0,0,0 (+id11)
id9 :1,1,0,1,1,1,0,0,0,0 (+id13)
id12:0,1,1,1,1,1,1,1,1,1 (+id15)
id16:1,1,1,1,1,1,1,1,0,0
Conveniently, even this "simple" approach allowed for us to assign the remaining ids to the original 7 groups without conflict. This is unlikely to happen in practice with as you say "500-1000" ids and up to 30 columns, but far from impossible. Roughly speaking 500 / 30 is roughly 17, and 1000 / 30 is roughly 34. So I would expect you to be able to get down to roughly 10-60 rows with about 15-45 being likely, but it's highly dependent on the data and a bit of luck.
In theory, the performance of this method is O(NT) where N is the number of individuals (ids) and T is the number of time slots (columns). It takes O(NT) to build the data structure (basically converting the table into a graph). After that for each column it requires checking and assigning at most O(N) individual ids, they might be checked multiple times. In practice since O(T) is roughly O(sqrt(N)) and performance increases as you go through the algorithm (similar to quick sort), it is likely O(N log N) or O(N sqrt(N)) on average, though really it's probably more accurate to use O(E) where E is the number of 1s (edges) in the table. Each each likely gets checked and iterated over a fixed number of times. So that is probably a better indicator.
The NP hard part comes into play in working out which ids to assign to which groups such that no new groups (rows) are created or a lowest possible number of new groups are created. I would run the "fast implementation" and the "random" approaches a few times and see how many extra rows (beyond the known minimum) you have, if it's a small amount.
This problem, contrary to some comments, is not NP-complete due to the restriction that "There cannot be two separated sequences in a line". This restriction implies that each line can be considered to be representing a single interval. In this case, the problem reduces to a minimum coloring of an interval graph, which is known to be optimally solved via a greedy approach. Namely, sort the intervals in descending order according to their ending times, then process the intervals one at a time in that order always assigning each interval to the first color (i.e.: consolidated line) that it doesn't conflict with or assigning it to a new color if it conflicts with all previously assigned colors.
Consider a constraint programming approach. Here is a question very similar to yours: Constraint Programming: Scheduling with multiple workers.
A very simple MiniZinc-model could also look like (sorry no Matlab or R):
include "globals.mzn";
%int: jobs = 4;
int: jobs = 16;
set of int: JOB = 1..jobs;
%array[JOB] of var int: start = [0, 6, 4, 0];
%array[JOB] of var int: duration = [3, 4, 1, 4];
array[JOB] of var int: start = [0, 6, 4, 0, 1, 8, 4, 0, 0, 6, 4, 1, 3, 9, 4, 1];
array[JOB] of var int: duration = [3, 4, 1, 5, 3, 2, 3, 4, 2, 2, 1, 3, 3, 1, 6, 8];
var int: machines;
constraint cumulative(start, duration, [1 | j in JOB], machines);
solve minimize machines;
This model does not, however, tell which jobs are scheduled on which machines.
Edit:
Another option would be to transform the problem into a graph coloring problem. Let each line be a vertex in a graph. Create edges for all overlapping lines (the 1-segments overlap). Find the chromatic number of the graph. The vertices of each color then represent a combined line in the original problem.
Graph coloring is a well-studied problem, for larger instances consider a local search approach, using tabu search or simulated annealing.

COUNTIF where criterion is a specific sequence of cells

I'm doing some work with arithmetic sequences modulo P, in which the sequences become periodic under the modulo. My worksheet generates a sequence mod P with the first term being 0, the second term being a number K (referencing another cell), and the following terms following the recurrence relation. The period of the sequence (number of values before it repeats itself) is related to the ratio P/K, s, for example, if P=2 and K=1, I get the sequence {0,1,1,0,1,1,0,1,1,...}, which has a period of 3, so when P/K=2, the period is 3.
I currently have a formula which uses the COUNTIF function to count the number of zeroes in the range, which is then divided out of the total range, currently an arbitrary size of 120, and this gives me the correct period for many ratios of P/K. Most of the time, however, the sequence generated exhibits semi-periodicity and sometimes even quasi-periodicity, such as in the case of K=1 and modulo 9: {0,1,1,2,3,5,8,4,3,7,1,8,0,8,8,7,6,4,1,5,6,2,8,1,...}, where P/K=9, the period is 24, and the semi-period is 12 (because of the 0,8,8,... part of the sequence). In such cases, my current COUNTIF formula thinks the full period is 12, even though it should be 24, because it counts the zeroes which define the semi-period.
What I would like to do is adjust the formula so that instead of the criterion for counting being 0, it would only count triplet sequences of cells in the pattern 0,K,K.
My current formula:
=QUOTIENT(120,(COUNTIF(B2:DQ2,0)))
So if I have =QUOTIENT(120,(COUNTIF(B2:DQ2,*X*))) I want the "X", which is currently 0, to reference a specific sequence of cells, namely the first three of the overall series, so something like: =QUOTIENT(120,(COUNTIF(B2:DQ2,(0,C2,D2)))) although obviously that criterion is not in remotely the correct syntax.
I'm not well-versed in writing macros, so that would probably be out of the question.
I would do this with four helper rows plus the final formula. Someone more clever than I am might be able to do it in one cell with an array formula; but compared to array formulas I think the helper rows are easier to understand and, if desired, tweak.
Once this is set up, if you're always going to use three as your criterion, you can hide the helper rows (to hide a row, right-click on the gray number label on the left side of the spreadsheet, and choose "hide").
So your sequence is in row 2, starting in column B. We'll set up the first helper row in row 3, starting in column C. In cell C3 put the formula =C2=$B$2. This will evaluate to FALSE, which is equivalent to 0. Copy and paste that formula all the way to cell DQ3 (or however many columns you want to run it). Cells below a sequence number equal to the first number in the sequence will evaluate to TRUE, which is equivalent to 1.
The next two helper rows are very similar. In cell D4 put the formula =D2=$C$2 and copy and paste to cell DQ4. This row tests which cells are equal to the second number in the sequence.
In cell E5 put the formula =E2=$D$2 and copy and paste to cell DQ5, showing which cells are equal to the third number in the sequence.
The last helper row is a little different, so I left an empty row after the first three helpers. In cell E7 I put the formula =SUM(C3,D4,E5); copy and paste that over to column DQ. This counts how many matches were found in the previous three helper rows. If all three match, the result of this formula will be 3 and your criterion for determining the period will have been fulfilled.
Now to show the period: in the cell you want to have this number, put the formula =MATCH(3,E7:DQ7,0). This searches the last (fourth) helper row looking for a cell that is equal to 3. (Obviously you could modify this method to match only the first two sequence numbers, or to match more than 3, and then you'd adjust the first parameter in the MATCH formula.) The last parameter in this MATCH formula is 0 because the helper row is not sorted. The return value is the index of the first match: a match in E7 would be index 1, a match in E8 would be index 2, etc.
I tested this in LibreOffice 4.4.4.3.

Generating two sets of numbers where the sum of each set and the sum of their dot product is N

In this question Getting N random numbers that the sum is M, the object was to generate a set of random numbers that sums to a specific number N. After reading this question, I started playing around with the idea of generating sets of numbers that satisfy this condition
sum(A) == sum(B) && sum(B) == sum(A * B)
An example of this would be
A <- c(5, 5, -10, 6, 6, -12)
B <- c(5, -5, 0, 6, -6, 0)
In this case, the three sums equal zero. Obviously, those sets aren't random, but they satisfy the condition. Is there a way to generate 'random' sets of data that satisfy the above condition? (As opposed to using a little algorithm as in the above example.)
(Note: I tagged this as an R question, but the language really doesn't matter to me.)
You'd need to define the first vector in n-dimensional space, and the 2nd one will have N-2 degrees of freedom (i.e. random numbers) since the sum and one angle are already determined.
The 2nd vector would need to be transformed into N-dimensional space; There are infinitely many transforms that could work, so if you don't care about the probability distribution of the resulting vectors, just choose the one that's most intuitive to you.
There's a nice geometrical interpretation to the first constraint: it constrains the 2nd vector to a (hyper-)plane in N-dimensional space; the 2nd constraint doesn't have a simple geometric interpretation.
check out hyperspherical cooridnates.
You can generate one set completely randomly. And generate randomly all numbers in set B except for two numbers. Since you have two equations you should be able to solve for those two numbers.

Math: Five numbers with unique sums

So I need a way to figure out how to get 5 numbers, and when you add any 2 of them, it will result in a sum that you can only get by adding those specific two numbers.
Here's an example of what I'm talking about, but with 3 numbers:
1
3
5
1 + 3 = 4
1 + 5 = 6
3 + 5 = 8
Adding any two of those numbers will end up with a unique sum that cannot be found by adding any other pair of the numbers. I need to do this, but with 5 different numbers. And if you have a method of figuring out how to do this with any amount of numbers, sharing that would be appreciated as well.
Thank you
1, 10, 100, 10000, 100000 gives you five numbers like you desire.
In general, 1, 10, 100, 1000, ..., 10^k where k is the number of numbers that you need.
And even more general, you can say b^0, b^1, ..., b^k, where b >= 2. Note that you have the special property that not only are all the pairwise sums unique, but all the subset sums are unique (just look at representations in base b).
The set {1, 2, 5, 11, 21} also works.
You can start with a set of two or three elements that fit that property (any addition operation on two elements from the set {1,2,5} gives you an unique sum) and only include the next number being considered if additions of current elements and this new element also give you unique sums.
An example run-through:
Suppose our starting set S is S={1,2,5}. Let U be the set of all sums between two elements in S.
Elements in S give us unique sums 1+2=3, 1+5=6, 2+5=7, so U={3,6,7}.
Consider adding 11 to this set. We need to check that 1+11, 2+11, and 5+11 all give us sums that are not seen in U and are all unique among themselves.
1+11=12, 2+11=13, 5+11=17.
Since 12, 13, and 17 are all unique sums among themselves, and are not found in U, we can update S and U to be:
S1 = {1,2,5,11}
U1 = {3,6,7,12,13,17}.
You can do the same procedure for 21, and you should (hopefully) get:
S2 = {1,2,5,11,21}
U2 = {3,6,7,12,13,17,22,23,26,32}.
If all you need is a quick set though, the solution that Jason posted is a lot faster to produce.
1
2
4
8
16
1
3
9
27
81
suggests x ^ n where n is a member of a subset of Natural numbers

Maths Question: number of different permutations

This is more of a maths question than programming but I figure a lot of people here are pretty good at maths! :)
My question is: Given a 9 x 9 grid (81 cells) that must contain the numbers 1 to 9 each exactly 9 times, how many different grids can be produced. The order of the numbers doesn't matter, for example the first row could contain nine 1's etc. This is related to Sudoku and we know the number of valid Sudoku grids is 6.67×10^21, so since my problem isn't constrained like Sudoku by having to have each of the 9 numbers in each row, column and box then the answer should be greater than 6.67×10^21.
My first thought was that the answer is 81! however on further reflection this assumes that the 81 numbers possible for each cell are different, distinct number. They are not, there are 81 possible numbers for each cell but only 9 possible different numbers.
My next thought was then that each of the cells in the first row can be any number between 1 and 9. If by chance the first row happened to be all the same number, say all 1s, then each cell in the second row could only have 8 possibilites, 2-9. If this continued down until the last row then number of different permutations could be calculated by 9^2 * 8^2 * 7^2 ..... * 1^2. However this doesn't work if each row doesn't contain 9 of the same number.
It's been quite a while since I studied this stuff and I can't think of a way to work it out, I'd appreciate any help anyone can offer.
Imagine taking 81 blank slips of paper and writing a number from 1 to 9 on each slip (nine of each number). Shuffle the deck, and start placing the slips on the 9x9 grid.
You'd be able to create 81! different patterns if you considered each slip to be unique.
But instead you want to consider all the 1's to be equivalent.
For any particular configuration, how many times will that configuration be repeated
due to the 1's all being equivalent? The answer is 9!, the number of ways you can permute the nine slips with 1 written on them.
So that cuts the total number of permutations down to 81!/9!. (You divide by the number of indistinguishable permutations. Instead of 9! indistinguishable permutations, imagine there were just 2 indistinguishable permutations. You would divide the count by 2, right? So the rule is, you divide by the number of indistinguishable permutations.)
Ah, but you also want the 2's to be equivalent, and the 3's, and so forth.
By the same reasoning, that cuts down the number of permutations to
81!/(9!)^9
By Stirling's approximation, that is roughly 5.8 * 10^70.
First, let's start with 81 numbers, 1 through 81. The number of permutations for that is 81P81, or 81!. Simple enough.
However, we have nine 1s, which can be arranged in 9! indistinguishable permutations. Same with 2, 3, etc.
So what we have is the total number of board permutations divided by all the indistinguishable permutations of all numbers, or 81! / (9! ** 9).
>>> reduce(operator.mul, range(1,82))/(reduce(operator.mul, range(1, 10))**9)
53130688706387569792052442448845648519471103327391407016237760000000000L

Resources