So I have a DAG representing a project and each node is a task, which has a variable that says how long it takes to finish that task.
We can assume that it is possible to work on any number of tasks at the same time.
How do I find optimal schedule of tasks so that I find a sequence of tasks that will result in earliest completion of the project.
If you can work on any number of tasks in parallel, then an optimal schedule can easily be computed. The starting time for a task in an optimal schedule can recursively be defined as the maximum of the optimal end times (i.e. optimal start time plus duration) of all its predecessor nodes in the graph. Tasks without predecessors all start at time 0 (or at whatever time you want them to start).
This recursion can be computed iteratively by iterating over the tasks in a topological order. In pseudocode, the algorithm could look like this:
// Computes optimal starttime for tasks in the dag.
// Input: dag : DAG with tasks as nodes
// duration : Array with duration of tasks
// Output: Array with optimal starttime of tasks
def computeSchedule(dag, duration)
starttime := [0 for each node in dag]
for t in nodesInTopologicalOrder(dag):
starttime[t] = max(starttime[p] + duration[p] for p in predecessors(t))
return starttime
Related
I have a unidirectional graph.
The structure is as follows:
There are about 20,000 nodes in the graph.
I make the simplest request: MATCH (b1)-[:NEXT_BAR*10]->(b2) RETURN b1.id, b2.id LIMIT 5
The request is processed quickly.
But if I increase the number of relationships, the query takes much longer to process. In other words, the speed depends on the number of relationships.
This request takes longer than 5 minutes to complete: MATCH (b1)-[:NEXT_BAR*10000]->(b2) RETURN b1.id, b2.id LIMIT 5
This is still a simplified version. The request can have more than two nodes and the number of relationships can still be a range.
How can I optimize a query with a large number of relationships?
Perhaps there are other graph DBMS where there is no such problem?
Variable-length relationship queries have exponential time and memory complexity.
If R is the average number of suitable relationships per node, and D is the depth of the search, then the complexity is O(R ** D). This complexity will exist in any DBMS.
The theory is simple here, but there are a couple of intricacies in the query execution.
-[:NEXT_BAR*10000]- matches a path that is precisely 10000 edges in size, so query engine spends some time to find these paths. Another thing to mention is that in (b1)-[...]- >(b2), b1 and b2 are not specific, which means that the query engine has to scall all nodes. If there is a limit, yea, scall all should return a limited number of items. The whole execution also depends on the efficiency of variable-length path implementation.
Some of the following might help:
Is it feasible to start from a specific node?
If there are branches, the only hope is aggressive filtering because of exponential complexity (as cybersam well explained).
Use a smaller number in the variable expand, or a range, e.g., [NEXT_BAR*..10000]. In this case, the query engine will match any path up to 10000 in size (different semantics, but maybe applicable).
* means the DFS type of execution. On the other hand, BFS might be the right approach. Memgraph (DISCLAIMER: I'm the co-founder and CTO) also supports BFS type of execution with filtering lambda.
Here is a Python script I've used to generate and import data into Memgraph. By using small nodes_no you can quickly notice the execution patterns.
import mgclient
# Make a connection to the database.
connection = mgclient.connect(
host='127.0.0.1',
port=7687,
sslmode=mgclient.MG_SSLMODE_REQUIRE)
connection.autocommit = True
cursor = connection.cursor()
# Clean and setup database instance.
cursor.execute("""MATCH (n) DETACH DELETE n;""")
cursor.execute("""CREATE INDEX ON :Node(id);""")
# Import dataset.
nodes_no = 10
# Create nodes.
for identifier in range(0, nodes_no):
cursor.execute("""CREATE (:Node {id: "%s"});""" % identifier)
# Create edges.
for identifier in range(1, nodes_no):
cursor.execute("""
MATCH (start_node:Node {id: "%s"})
MATCH (end_node:Node {id: "%s"})
CREATE (start_node)-[:NEXT_BAR]->(end_node);
""" % (identifier - 1, identifier))
I am looking to calculate the wait in a queue per position or a general time based on your queue position. It is a FIFO.
List of current performance status of the service
Size AvTime Queue Processing AvgFileSize(mb)
1 (0 - 1 mb) 2.57 18 3 0.21
2 (1 - 5 mb) 12.43 2 4 2.16
3 (5 - 10 mb) 23.38 9 8 6.72
4 (10 - 25 mb) 38.17 1 4 12.52
5 (>= 25 mb) 109.31 0 0 32.41
The current list of processing and queued batch files. Only lists the current users files so that is why there are queue numbers missing.
Queue Filename Status
30 Batch (3456).XML(2) Queue
20 Batch (2399).xml(3) Queue
14 batch (1495).xml(1) Queue
12 batch (1497).xml(1) Queue
15 batch (1499).xml(1) Queue
10 batch (1500).xml(4) Queue
13 batch (1496).xml(1) Queue
11 batch (1501).xml(1) Queue
9 batch (1498).xml(1) Queue
8 batch (1494).xml(1) Queue
7 batch (1493).xml(1) Queue
6 batch (1492).xml(1) Queue
5 batch (1491).xml(1) Queue
4 batch (1490).xml(1) Queue
3 batch (1).xml(1) Queue
2 Batch1.xml(1) Queue
1 Batch1.XML(2) Queue
Batch1.xml(1) Processing
Batch1.xml(1) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(1) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(4) Processing
Batch1.xml(2) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(2) Processing
Batch1.xml(2) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(2) Processing
So I am looking to add more information to the list how long until a batch file at position 20 will be waiting in the queue before it starts processing.
Queue Filename Status
30 (*30min) Batch (3456).XML(2) Queue
20 (*10min) Batch (2399).xml(3) Queue
...
*estimated
Your question doesn't quite provide enough context to make it possible to answer, but I can make some guesses based on the sample displays you provided.
Looks like you have a "single queue, multiple server" setup. In other words, you have a single FIFO queue, and a some fixed number N of jobs that can be in processing at any given time. Is that right?
For your algorithm, let's assume you have the following information:
Position of our job in queue (position N means there are N jobs ahead
of us)
Size of our job
Size of each job ahead of us in the queue
Pool of jobs being processed, with a certain maximum size N
Size of each job currently being processed
Elapsed time for each job currently in process (how long since that job started)
First of all, you will need a function ExpectedJobDuration(jobsize) that computes an expected job processing time for a job of a given size, based on the statistics shown in your "performance status" table. This looks pretty straightforward. Given a job size, first figure out which of your five size categories it falls into (0: 0-1mb, 1: 1-5mb, etc.) Then take your job size and multiply by the average time divided by the average size of jobs in that category. That will give you an estimate of ExpectedJobDuration(jobsize), which will tell you how long it takes to run a job of a given size, under the assumption that job time is proportional to job size, for jobs within a particular size range.
Now, for a job of a given size that's already been in process for a given time ElapsedProcessingTime, how long do we expect it to to take complete? A simple answer would be something like:
ExpectedRemainingTime = ExpectedJobDuration(jobsize) - ElapsedProcessingTime.
For jobs sitting the the queue this will be exactly the same as the expected job duration; for jobs already being processed we subtract the time the job has already been in work. However, if there is some random variation in job processing times, this is not exactly right, and could turn out to be negative. This is sort of like the actuarial problem: the average lifespan of a person is X years, how long do we expect someone to live if they are already Y years old? You would need a lot more statistical data to compute this, so for practical purposes, if the answer comes out negative, just set it to zero. (If someone is 100 years old, and the average human lifespan is 90, expect them to die at any moment. That's not quite right, but perhaps OK as a first approximation. Unless you are the 100 year old person, and not yet ready to die. :-))
OK, now we have a way to compute how long each job ahead of us in the queue should take, and how long it should take to complete jobs already in process.
If the number of jobs currently being processed is less than N (the max that can be processed at any given time) then our job can start right away. So in that case we have the answer - expected delay until our job can start is zero seconds.
Now let's look at the case where we are in position 0 in the queue. That means there are no jobs ahead of us in the queue, so our expected time to start is the minimum of the ExpectedRemainingTime of the jobs in the processing pool.
Now that gives us the basis for a recursive function that computes delay until our expected start time.
DelayUntilStart(jobPool, currentJob, queue) {
find minJob in jobPool with minimum ExpectedRemainingTIme
if currentJob is in position zero of queue
return expectedRemainingTime(minJob)
else
remove minJob from jobPool
pop the top job from the queue and put it in the jobPool
return ExpectedRemainingTime(minJob) + DelayUntilStart(jobPool, currentJob, queue)
done
}
Note - we may have a very long job ahead of us in the queue - but that doesn't mean we have to wait for it to complete. We just have to wait for it to get into the pool of jobs currently being processed, and then a shorter job might complete and let us into the pool.
The algorithm I just described is going to be an approximation. But it's probably about as good you are going to get without a lot of statistics about job processing times. For practical purposes I bet it would work pretty well.
I am trying to solve an extension of Assignment problem, where both tasks and the man hours are divisible.
for instance, a man X has 4 hours available in a day, can do 1/3 of task A in 2 hours, 1/4 of task B in 4 hours. Man Y has 10 hours available can do 1/5 of a task A in 1.3 hours, 1/8 of task B in 6 hours. Is there an extension of BiPartite matching which can solve this?
Thanks!
I don't think that you can easily model this as a bipartite matching. However, it should be fairly easy to create a linear program for your problem. Just have for every worker a set of variables x_{i,j} which indicates how much of person i's time is allocated to task j.
Let h_i be the number of hours available for person i. Then, for every person i it must hold that
Let a_{i,j} be the "efficiency" of person i at task j, i.e., how much of the task the person can do in one hour. Then, for every task j it must hold that:
That's it. No integrality constraints or anything.
I'm using JMH and I find something hard to understand: I have one method annotated with #Benchmark and I set measurementIterations(3). The method is called 3 times, but within each iteration call, the function runs a rather big and random number of times.
My question is: is that number completely random? Is there a way to control it and determine how many times should the function run within an iteration? And what is the importance with set up the measurementIterations if each way or another, the function will run a random number of times?
measurementIterations defines how many measured iterations you want to measure of the benchmark. I don't know which parameters you have specified but by default JMH runs the benchmark time-based (default I guess 1 second). This means the benchmark method is invoked in that time frame as often as possible. There are possibilities to specify how often the method should be called in one iteration (-> batching).
I would recommend to study the JMH Samples provided by JMH: http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/
They are a very good introduction into JMH and cover pitfalls you easily make within benchmarks.
The number of iteration depends on the various JMH modes I think you must be using Avgtime mode it will perform various iterations.
/////////////////////////////////////////////////////////////////////////////////
Mode.Throughput: Calculate number of operations in a time unit.
Mode.AverageTime: Calculate an average running time.
Mode.SampleTime: Calculate how long does it take for a method to run
(including percentiles).
Mode.SingleShotTime: Just runs a method
once (useful for cold-testing mode).
////////////////////////////////////////////////////////////////////////////////
For example Use mode "Mode.SingleShotTime", it will perform iteration exactly the number of times you mentioned in the run(see below).
// Example runner class
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_01_HelloWorld.class.getSimpleName())
.warmupIterations(1)// number of times the warmup iteration should take place
.measurementIterations(1)//number of times the actual iteration should take place
.forks(1)
.shouldDoGC(true)
.build();
new Runner(opt).run();
}
JMH is doing warm-up iterations that are not measured but necessary for valid results.
measurementIterations defines how many iterations should be measured. This does not include warm-up, because warm-up is not measured.
Yes, in every iteration, the times for method running is random(it's the max number of times the method can run). The times is not important. What is important is the average time used each time.
Besides, you can control how many iterations to run with measurementIterations() and the duration of every iteration with measurementTime().
For example, if you want to run you method with only 1 iteration and it's duration to 1ms, without warmup, just set warmupIterations to 0, measurementTime to 1ms, measurementIterations to 1. Like below:
Options opt = new OptionsBuilder()
.include(xxx.class.getSimpleName())
.warmupIterations(0)
.measurementTime(TimeValue.milliseconds(1))
.measurementIterations(1)
.forks(1)
.build();
Significance for mutiple iterations: Run more, the results should be more reliable.
This is another question from my past midterm, and i am supposed to give a formal formulation, describe the algorithm used, and justify the correctness. Here is the problem:
The University is trying to schedule n different classes. Each class has a start and finish time. All classes have to be taught on Friday. There are only two classrooms available.
Help the university decide whether it is possible to schedule these classes without causing any time conflict (i.e. two classes with overlapping class times are scheduled in the same classroom).
Sort the classes by starting time (O(nlogn)), then go through them in order (O(n)), noting starting and ending times and looking for the case of more than two classes going on at the same time.
This isn't a problem with a bipartite graph solution. Who told you it was?
#Beta is nearly correct. Create a list of pairs <START, time> and <END, time>. Each class has two pairs in the list, one START, and one END.
Now sort the list by time. Or if you like, put them in a min heap, which amounts to heapsort. For equal times, put the END triples before START. Then execute the following loop:
set N = 0
while sorted list not empty
pop <tag, time> from the head of the list
if tag == START
N = N + 1
if N > 2 return "can't schedule"
else // tag == END
N = N - 1
end
end
return "can schedule"
You can easily enrich the algorithm a bit to return the time periods where more than 2 classes are in session at the same time, return those classes, and other useful information.
This indeed IS a bipartite/bicoloring problem.
Imagine each class to be a node of a graph. Now create an edge between 2 nodes if they have time overlap. Now the final graph that you get if you can bicolor this graph then its possible to schedule all the class. Otherwise not.
The graph you created if it can be bicolored, then each "black" node will belong to room1 and each "white" node will belong to room2