I have a calendar, typically a csv file containing a number of lines. Each line corresponds to an individual and is a sequence of consecutive values '0' and '1' where '0' refers to an empty time slot and '1' to an occupied slot. There cannot be two separated sequences in a line (e.g. two sequences of '1' separated by a '0' such as '1,1,1,0,1,1,1,1').
The problem is to minimize the number of lines by combining the individuals and resolving the collisions between time slots. Note the time slots cannot overlap. For example, for 4 individuals, we have the following sequences:
id1:1,1,1,0,0,0,0,0,0,0
id2:0,0,0,0,0,0,1,1,1,1
id3:0,0,0,0,1,0,0,0,0,0
id4:1,1,1,1,0,0,0,0,0,0
One can arrange them to end up with two lines, while keeping track of permuted individuals (for the record). In our example it yields:
1,1,1,0,1,0,1,1,1,1 (id1 + id2 + id3)
1,1,1,1,0,0,0,0,0,0 (id4)
The constraints are the following:
The number of individuals range from 500 to 1000,
The length of the sequence will never exceed 30
Each sequence in the file has the exact same length,
The algorithm needs to be reasonable in execution time because this task may be repeated up to 200 times.
We don't necessarly search for the optimal solution, a near optimal solution would suffice.
We need to keep track of the combined individuals (as in the example above)
Genetic algorithms seems a good option but how does it scales (in terms of execution time) with the size of this problem?
A suggestion in Matlab or R would be (greatly) appreciated.
Here is a sample test:
id1:1,1,1,0,0,0,0,0,0,0
id2:0,0,0,0,0,0,1,1,1,1
id3:0,0,0,0,1,0,0,0,0,0
id4:1,1,1,1,1,0,0,0,0,0
id5:0,1,1,1,0,0,0,0,0,0
id6:0,0,0,0,0,0,0,1,1,1
id7:0,0,0,0,1,1,1,0,0,0
id8:1,1,1,1,0,0,0,0,0,0
id9:1,1,0,0,0,0,0,0,0,0
id10:0,0,0,0,0,0,1,1,0,0
id11:0,0,0,0,1,0,0,0,0,0
id12:0,1,1,1,0,0,0,0,0,0
id13:0,0,0,1,1,1,0,0,0,0
id14:0,0,0,0,0,0,0,0,0,1
id15:0,0,0,0,1,1,1,1,1,1
id16:1,1,1,1,1,1,1,1,0,0
Solution(s)
#Nuclearman provided a working solution in O(NT) (where N is the number of individuals (ids) and T is the number of time slots (columns)) based on the Greedy algorithm.
I study algorithms as a hobby and I agree with Caduchon on this one, that greedy is the way to go. Though I believe this is actually the clique cover problem, to be more accurate: https://en.wikipedia.org/wiki/Clique_cover
Some ideas on how to approach building cliques can be found here: https://en.wikipedia.org/wiki/Clique_problem
Clique problems are related to independence set problems.
Considering the constraints, and that I'm not familiar with matlab or R, I'd suggest this:
Step 1. Build the independence set time slot data. For each time slot that is a 1, create a mapping (for fast lookup) of all records that also have a one. None of these can be merged into the same row (they all need to be merged into different rows). IE: For the first column (slot), the subset of the data looks like this:
id1 :1,1,1,0,0,0,0,0,0,0
id4 :1,1,1,1,1,0,0,0,0,0
id8 :1,1,1,1,0,0,0,0,0,0
id9 :1,1,0,0,0,0,0,0,0,0
id16:1,1,1,1,1,1,1,1,0,0
The data would be stored as something like 0: Set(id1,id4,id8,id9,id16) (zero indexed rows, we start at row 0 instead of row 1 though probably doesn't matter here). Idea here is to have O(1) lookup. You may need to quickly tell that id2 is not in that set. You can also use nested hash tables for that. IE: 0: { id1: true, id2: true }`. Sets also allow for usage of set operations which may help quite a bit when determining unassigned columns/ids.
In any case, none of these 5 can be grouped together. That means at best (given that row) you must have at least 5 rows (if the other rows can be merged into those 5 without conflict).
Performance of this step is O(NT), where N is the number of individuals and T is the number of time slots.
Step 2. Now we have options of how to attack things. To start, we pick the time slot with the most individuals and use that as our starting point. That gives us the min possible number of rows. In this case, we actually have a tie, where the 2nd and 5th rows both have 7. I'm going with the 2nd, which looks like:
id1 :1,1,1,0,0,0,0,0,0,0
id4 :1,1,1,1,1,0,0,0,0,0
id5 :0,1,1,1,0,0,0,0,0,0
id8 :1,1,1,1,0,0,0,0,0,0
id9 :1,1,0,0,0,0,0,0,0,0
id12:0,1,1,1,0,0,0,0,0,0
id16:1,1,1,1,1,1,1,1,0,0
Step 3. Now that we have our starting groups we need to add to them while trying to avoid conflicts between new members and old group members (which would require an additional row). This is where we get into NP-complete territory as there are a ton (roughly 2^N to be more accurately) to assign things.
I think the best approach might be a random approach as you can theoretically run it as many times as you have time for to get results. So here is the randomized algorithm:
Given the starting column and ids (1,4,5,8,9,12,16 above). Mark this column and ids as assigned.
Randomly pick an unassigned column (time slot). If you want a perhaps "better" result. Pick the one with the least (or most) number of unassigned ids. For faster implementation, just loop over the columns.
Randomly pick an unassigned id. For a better result, perhaps the one with the most/least groups that could be assigned that ID. For faster implementation, just pick the first unassigned id.
Find all groups that unassigned ID could be assigned to without creating conflict.
Randomly assign it to one of them. For faster implementation, just pick the first one that doesn't cause a conflict. If there are no groups without conflict, create a new group and assign the id to that group as the first id. The optimal result is that no new groups have to be created.
Update the data for that row (make 0s into 1s as needed).
Repeat steps 3-5 until no unassigned ids for that column remain.
Repeat steps 2-6 until no unassigned columns remain.
Example with the faster implementation approach, which is an optimal result (there cannot be less than 7 rows and there are only 7 rows in the result).
First 3 columns: No unassigned ids (all have 0). Skipped.
4th Column: Assigned id13 to id9 group (13=>9). id9 Looks like this now, showing that the group that started with id9 now also includes id13:
id9 :1,1,0,1,1,1,0,0,0,0 (+id13)
5th Column: 3=>1, 7=>5, 11=>8, 15=>12
Now it looks like:
id1 :1,1,1,0,1,0,0,0,0,0 (+id3)
id4 :1,1,1,1,1,0,0,0,0,0
id5 :0,1,1,1,1,1,1,0,0,0 (+id7)
id8 :1,1,1,1,1,0,0,0,0,0 (+id11)
id9 :1,1,0,1,1,1,0,0,0,0 (+id13)
id12:0,1,1,1,1,1,1,1,1,1 (+id15)
id16:1,1,1,1,1,1,1,1,0,0
We'll just quickly look the next columns and see the final result:
7th Column: 2=>1, 10=>4
8th column: 6=>5
Last column: 14=>4
So the final result is:
id1 :1,1,1,0,1,0,1,1,1,1 (+id3,id2)
id4 :1,1,1,1,1,0,1,1,0,1 (+id10,id14)
id5 :0,1,1,1,1,1,1,1,1,1 (+id7,id6)
id8 :1,1,1,1,1,0,0,0,0,0 (+id11)
id9 :1,1,0,1,1,1,0,0,0,0 (+id13)
id12:0,1,1,1,1,1,1,1,1,1 (+id15)
id16:1,1,1,1,1,1,1,1,0,0
Conveniently, even this "simple" approach allowed for us to assign the remaining ids to the original 7 groups without conflict. This is unlikely to happen in practice with as you say "500-1000" ids and up to 30 columns, but far from impossible. Roughly speaking 500 / 30 is roughly 17, and 1000 / 30 is roughly 34. So I would expect you to be able to get down to roughly 10-60 rows with about 15-45 being likely, but it's highly dependent on the data and a bit of luck.
In theory, the performance of this method is O(NT) where N is the number of individuals (ids) and T is the number of time slots (columns). It takes O(NT) to build the data structure (basically converting the table into a graph). After that for each column it requires checking and assigning at most O(N) individual ids, they might be checked multiple times. In practice since O(T) is roughly O(sqrt(N)) and performance increases as you go through the algorithm (similar to quick sort), it is likely O(N log N) or O(N sqrt(N)) on average, though really it's probably more accurate to use O(E) where E is the number of 1s (edges) in the table. Each each likely gets checked and iterated over a fixed number of times. So that is probably a better indicator.
The NP hard part comes into play in working out which ids to assign to which groups such that no new groups (rows) are created or a lowest possible number of new groups are created. I would run the "fast implementation" and the "random" approaches a few times and see how many extra rows (beyond the known minimum) you have, if it's a small amount.
This problem, contrary to some comments, is not NP-complete due to the restriction that "There cannot be two separated sequences in a line". This restriction implies that each line can be considered to be representing a single interval. In this case, the problem reduces to a minimum coloring of an interval graph, which is known to be optimally solved via a greedy approach. Namely, sort the intervals in descending order according to their ending times, then process the intervals one at a time in that order always assigning each interval to the first color (i.e.: consolidated line) that it doesn't conflict with or assigning it to a new color if it conflicts with all previously assigned colors.
Consider a constraint programming approach. Here is a question very similar to yours: Constraint Programming: Scheduling with multiple workers.
A very simple MiniZinc-model could also look like (sorry no Matlab or R):
include "globals.mzn";
%int: jobs = 4;
int: jobs = 16;
set of int: JOB = 1..jobs;
%array[JOB] of var int: start = [0, 6, 4, 0];
%array[JOB] of var int: duration = [3, 4, 1, 4];
array[JOB] of var int: start = [0, 6, 4, 0, 1, 8, 4, 0, 0, 6, 4, 1, 3, 9, 4, 1];
array[JOB] of var int: duration = [3, 4, 1, 5, 3, 2, 3, 4, 2, 2, 1, 3, 3, 1, 6, 8];
var int: machines;
constraint cumulative(start, duration, [1 | j in JOB], machines);
solve minimize machines;
This model does not, however, tell which jobs are scheduled on which machines.
Edit:
Another option would be to transform the problem into a graph coloring problem. Let each line be a vertex in a graph. Create edges for all overlapping lines (the 1-segments overlap). Find the chromatic number of the graph. The vertices of each color then represent a combined line in the original problem.
Graph coloring is a well-studied problem, for larger instances consider a local search approach, using tabu search or simulated annealing.
I'm doing some work with arithmetic sequences modulo P, in which the sequences become periodic under the modulo. My worksheet generates a sequence mod P with the first term being 0, the second term being a number K (referencing another cell), and the following terms following the recurrence relation. The period of the sequence (number of values before it repeats itself) is related to the ratio P/K, s, for example, if P=2 and K=1, I get the sequence {0,1,1,0,1,1,0,1,1,...}, which has a period of 3, so when P/K=2, the period is 3.
I currently have a formula which uses the COUNTIF function to count the number of zeroes in the range, which is then divided out of the total range, currently an arbitrary size of 120, and this gives me the correct period for many ratios of P/K. Most of the time, however, the sequence generated exhibits semi-periodicity and sometimes even quasi-periodicity, such as in the case of K=1 and modulo 9: {0,1,1,2,3,5,8,4,3,7,1,8,0,8,8,7,6,4,1,5,6,2,8,1,...}, where P/K=9, the period is 24, and the semi-period is 12 (because of the 0,8,8,... part of the sequence). In such cases, my current COUNTIF formula thinks the full period is 12, even though it should be 24, because it counts the zeroes which define the semi-period.
What I would like to do is adjust the formula so that instead of the criterion for counting being 0, it would only count triplet sequences of cells in the pattern 0,K,K.
My current formula:
=QUOTIENT(120,(COUNTIF(B2:DQ2,0)))
So if I have =QUOTIENT(120,(COUNTIF(B2:DQ2,*X*))) I want the "X", which is currently 0, to reference a specific sequence of cells, namely the first three of the overall series, so something like: =QUOTIENT(120,(COUNTIF(B2:DQ2,(0,C2,D2)))) although obviously that criterion is not in remotely the correct syntax.
I'm not well-versed in writing macros, so that would probably be out of the question.
I would do this with four helper rows plus the final formula. Someone more clever than I am might be able to do it in one cell with an array formula; but compared to array formulas I think the helper rows are easier to understand and, if desired, tweak.
Once this is set up, if you're always going to use three as your criterion, you can hide the helper rows (to hide a row, right-click on the gray number label on the left side of the spreadsheet, and choose "hide").
So your sequence is in row 2, starting in column B. We'll set up the first helper row in row 3, starting in column C. In cell C3 put the formula =C2=$B$2. This will evaluate to FALSE, which is equivalent to 0. Copy and paste that formula all the way to cell DQ3 (or however many columns you want to run it). Cells below a sequence number equal to the first number in the sequence will evaluate to TRUE, which is equivalent to 1.
The next two helper rows are very similar. In cell D4 put the formula =D2=$C$2 and copy and paste to cell DQ4. This row tests which cells are equal to the second number in the sequence.
In cell E5 put the formula =E2=$D$2 and copy and paste to cell DQ5, showing which cells are equal to the third number in the sequence.
The last helper row is a little different, so I left an empty row after the first three helpers. In cell E7 I put the formula =SUM(C3,D4,E5); copy and paste that over to column DQ. This counts how many matches were found in the previous three helper rows. If all three match, the result of this formula will be 3 and your criterion for determining the period will have been fulfilled.
Now to show the period: in the cell you want to have this number, put the formula =MATCH(3,E7:DQ7,0). This searches the last (fourth) helper row looking for a cell that is equal to 3. (Obviously you could modify this method to match only the first two sequence numbers, or to match more than 3, and then you'd adjust the first parameter in the MATCH formula.) The last parameter in this MATCH formula is 0 because the helper row is not sorted. The return value is the index of the first match: a match in E7 would be index 1, a match in E8 would be index 2, etc.
I tested this in LibreOffice 4.4.4.3.