Verify that all edges in a 2D graph are sufficiently far from each other - graph

I have a graph where each node has coordinates in 2D (it's actually a geographic graph, with latitude and longitude.)
I need to verify that if the distance between two edges is less than MAX_DIST then they share a node. Of course, if they intersect, then the distance between them is zero.
The brute force algorithm is trivial, is there a more efficient algorithm?
I was thinking of trying to adapt https://en.wikipedia.org/wiki/Closest_pair_of_points_problem to graph edges (and ignoring pairs of edges with a shared node), but it is not trivial to do so.

I was curios to see how the rtree index idea would perform so I created a small script to test it using two really cool libraries for Python: Rtree and shapely
The snippet generates 1000 segments with 1 < length < 5 and coordinates in the [0, 100] interval, populates the index and then counts the pairs that are closer than MAX_DIST==0.1 (using the classic and the index-based method).
In my tests the index method was around 25x faster using the conditions above; this might vary greatly for your data set but the result is encouraging:
found 532 pairs of close segments using classic method
7.47 seconds for classic count
found 532 pairs of close segments using index method
0.28 seconds for index count
The performance and correctness of the index method depends on how your segments are distributed (how many are close, if you have very long segments, the parameters used).
import time
import random
from rtree import Rtree
from shapely.geometry import LineString
def generate_segments(number):
segments = {}
for i in range(number):
while True:
x1 = random.randint(0, 100)
y1 = random.randint(0, 100)
x2 = random.randint(0, 100)
y2 = random.randint(0, 100)
segment = LineString([(x1, y1), (x2, y2)])
if 1 < segment.length < 5: # only add relatively small segments
segments[i] = segment
break
return segments
def populate_index(segments):
idx = Rtree()
for index, segment in segments.items():
idx.add(index, segment.bounds)
return idx
def count_close_segments(segments, max_distance):
count = 0
for i in range(len(segments)-1):
s1 = segments[i]
for j in range(i+1, len(segments)):
s2 = segments[j]
if s1.distance(s2) < max_distance:
count += 1
return count
def count_close_segments_index(segments, idx, max_distance):
count = 0
for index, segment in segments.items():
close_indexes = idx.nearest(segment.bounds, 10)
for close_index in close_indexes:
if index >= close_index: # do not count duplicates
continue
close_segment = segments[close_index]
if segment.distance(close_segment) < max_distance:
count += 1
return count
if __name__ == "__main__":
MAX_DIST = 0.1
s = generate_segments(1000)
r_idx = populate_index(s)
t = time.time()
print("found %d pairs of close segments using classic method" % count_close_segments(s, MAX_DIST))
print("%.2f seconds for classic count" % (time.time() - t))
t = time.time()
print("found %d pairs of close segments using index method" % count_close_segments_index(s, r_idx, MAX_DIST))
print("%.2f seconds for index count" % (time.time() - t))

Related

Could not find the optimal solution after adding constraints

My code is as follows:
gekko = GEKKO(remote=True)
# create variable, each variable is a vector, each element
# of the vector is a binary
s = []
for i in range(N):
s.append(gekko.Array(gekko.Var, s_len[i], value=0, lb=0, ub=1, integer=True))
# some constants used in the objective/constraint function
c, d, r, m, L = create_c_d_r_m_L() # they are all numpy ndarry
# define the objective function
def objective():
obj = 0
for i in range(N):
obj += np.dot(s[i], c[i]) + np.dot(s[i], d[i])
for idx, (i, j) in enumerate(E):
obj += np.dot(np.dot(s[i], r[idx].reshape(s_len[i], s_len[j])),\
s[j]) # s[i] * r[i, j] * s[j]
return obj
# add constraints
# (a) each vector can only have and must have one 1
for i in range(N):
gekko.Equation(gekko.sum(s[i]) == 1)
# (b)
for t in range(N):
peak_mem = gekko.sum([np.dot(s[i], m[i]) for i in L[t]])
gekko.Equation(peak_mem < DEVICE_MEM)
# DEVICE_MEM is a predefined big int
# solve
gekko.Obj(objective())
gekko.solve(disp=True)
I found that when removing constraint (b), the solver can output the optimal solution for s. However, if we add (b) and set DEVICE_MEM to a very large number (which should not affect the solution), the s is not optimal anymore. I'm wondering if I am doing something wrong here because I tried both APOPT(solvertype=1) and IPOPT (solvertype=3) and they give the same nonoptimal results.
To give more context to the problem: this is an optimization over the graph. N represents the number of nodes in the graph. E is the set that contains all edges in the graph. c, d, m are three types of cost of a node. r is the cost of edges. Each node has multiple strategies (represented by the vector s[i]), and we need to select the best strategy for each node so that the overall cost is minimal.
Detailed constants:
# s_len: record the length of each vector
# (the # of strategies for each node,
# here we assume the length are all 10)
s_len = np.ones(N) * 10
# c, d, m are the costs of each node
# let's assume the c/d/m cost for i node is just i
c, d, m = [], [], []
for i in range(N):
c[i] = s_len[i] * [i]
d[i] = s_len[i] * [i]
m[i] = s_len[i] * [i]
# r is the edge cost, let's assume the cost for
# each edge is just i * j
r = []
for (i,j) in E: # E records all edges
cur_r = s_len[i] * s_len[j] * [i*j]
r.append(cur_r)
# L contains the node ids, we just randomly generate 10 integers here
L = []
for i in range(N):
cur_L = [randrange(N) for _ in range(10)]
L.append(cur_L)
I've been stuck on this for a while and any comments/answers are highly appreciated! Thanks!
Try reframing the inequality constraint:
for t in range(N):
peak_mem = gekko.sum([np.dot(s[i], m[i]) for i in L[t]])
gekko.Equation(peak_mem < DEVICE_MEM)
as a variable with an upper bound:
peak_mem = m.Array(m.Var,N,ub=DEVICE_MEM)
for t in range(N):
m.Equation(peak_mem[t]==\
gekko.sum([np.dot(s[i], m[i]) for i in L[t]])
The N inequality constraints peak_mem < DEVICE_MEM are converted to equality constraints with slack variables as s[i] = DEVICE_MEM - peak_mem with a simple inequality constraint on the slack s[i]>=0. If the inequality constraint far from the bound, then the slack variable can be very large. Formulating the equation as a variable may help.
I tried using the information in the question to pose a minimal problem that could reproduce the error and the potential solution. If you need more specific suggestions, please modify the code to be a complete and minimal example that reproduces the error. This helps with verifying the solution.

How to find n average number of trials before criteria are met with differing probabilities per outcome?

I've spent a few days trying to figure this out and looking up tutorials, but everything I've found so far seems like it's close to what I need but don't give the results I need.
I have a device that produces a single letter, A-F. For simplicity's sake, you can think of it like a die with letters. It will always produce one and only one letter each time it is used. However it has one major difference: each letter can have a differing known probability of being picked:
A: 25%
B: 5%
C: 20%
D: 15%
E: 20%
F: 15%
These probabilities remain constant throughout all attempts.
Additionally, I have a specific combination I must accrue before I am "successful":
As needed: 1
Bs needed: 3
Cs needed: 0
Ds needed: 1
Es needed: 2
Fs needed: 3
I need find the average number of letter picks (i.e. rolls/trials/attempts) that have to happen for this combination of letters to be accrued. It's completely fine for any individual outcome to have more than the required number of letters, but success is only counted when each letter has been chosen at least its minimum amount of times.
I've looked at plenty of tutorials for multinomial probability distribution and similar things, but I haven't found anything that explains how to find average number of trials for a scenario like this. Please kindly explain answers clearly as I'm not a wiz with statistics.
In addition to Severin's answer that logically looks good to me but might be costly to evaluate (i.e. infinite sum of factorials).
Let me provide some intuition that should give a good approximation.
Considering each category at a time. Refer this math stackexchange question/ answer. Expected number of tosses in which you would get the k number of successes for each category (i) can be calculated as k(i)/ P(i):
Given,
p(A): 25% ; Expected number of tosses to get 1 A = 1/ 0.25 = 4
p(B): 5% ; Expected number of tosses to get 3 B's = 3/ 0.05 = 60
p(C): 20% ; Expected number of tosses to get 0 C = 0/ 0.20 = 0
p(D): 15% ; Expected number of tosses to get 1 D = 1/ 0.15 = 6.67 ~ 7
p(E): 20% ; Expected number of tosses to get 2 E's = 2/ 0.20 = 10
p(F): 15% ; Expected number of tosses to get 3 F's = 3/ 0.15 = 20
you get an idea that getting 3 B's is your bottleneck, you can expect on average 60 tosses for your scenario to play out.
Well, minimum number of throws is 10. Average would be infinite sum
A=10•P(done in 10)+11•P(done in 11)+12•P(done in 12) + ...
For P(done in 10) we could use multinomial
P(10)=Pm(1,3,0,1,2,3|probs), where probs=[.25, .05, .20, .15, .20, .15]
For P(11) you have one more throw which you could distribute like this
P(11)=Pm(2,3,0,1,2,3|probs)+Pm(1,4,0,1,2,3|probs)+Pm(1,3,0,2,2,3|probs)+
Pm(1,3,0,1,3,3|probs)+Pm(1,3,0,1,2,4|probs)
For P(12) you have to distribute 2 more throws. Note, that there are combinations of throws which are impossible to get, like Pm(2,3,0,2,2,3|probs), because you have to stop earlier
And so on and so forth
Your process can be described as a Markov chain with a finite number of states, and an absorbing state.
The number of steps before reaching the absorbing state is called the hitting time. The expected hitting time can be calculated easily from the transition matrix of the Markov chain.
Enumerate all possible states (a, b, c, d, e, f). Consider only a finite number of states, because "b >= 3" is effectively the same as "b = 3", etc. The total number of states is (1+1)*(3+1)*(0+1)*(2+1)*(3+1) = 192.
Make sure that in your enumeration, starting state (0, 0, 0, 0, 0, 0) comes first, with index 0, and absorbing state (1, 3, 0, 1, 2, 3) comes last.
Build the transition matrix P. It's a square matrix with one row and column per state. Entry P[i, j] in the matrix gives the probability of going from state i to state j when rolling a die. There should be at most 6 non-zero entries per row.
For example, if i is the index of state (1, 0, 0, 1, 2, 2) and j the index of state (1, 1, 0, 1, 2, 2), then P[i, j] = probability of rolling face B = 0.05. Another example: if i is the index of state (1,3,0,0,0,0), then P[i,i] = probability of rolling A, B or C = 0.25+0.05+0.2 = 0.5.
Call Q the square matrix obtained by removing the last row and last column of P.
Call I the identity matrix of the same dimensions as Q.
Compute matrix M = (I - Q)^-1, where ^-1 is matrix inversion.
In matrix M, the entry M[i, j] is the expected number of times that state j will be reached before the absorbing state, when starting from state i.
Since our experiment starts in state 0, we're particularly interested in row 0 of matrix M.
The sum of row 0 of matrix M is the expected total number of states reached before the absorbing state. That is exactly the answer we seek: the number of steps to reach the absorbing state.
To understand why this works, you should read a course on Markov chains! Perhaps this one: James Norris' course notes on Markov chains. The chapter about "hitting times" (which is the name for the number of steps before reaching target state) is chapter 1.3.
Below, an implementation in python.
from itertools import product, accumulate
from operator import mul
from math import prod
import numpy as np
dice_weights = [0.25, 0.05, 0.2, 0.15, 0.2, 0.15]
targets = [1, 3, 0, 1, 2, 3]
def get_expected_n_trials(targets, dice_weights):
states = list(product(*(range(n+1) for n in targets)))
base = list(accumulate([n+1 for n in targets[:0:-1]], mul, initial=1))[::-1]
lookup = dict(map(reversed, enumerate(states)))
P = np.zeros((len(states), len(states)))
for i, s in enumerate(states):
a,b,c,d,e,f = s
for f, p in enumerate(dice_weights):
#j = index of state reached from state i when rolling face f
j = i + base[f] * (s[f] < targets[f])
j1 = lookup[s[:f] + (min(s[f]+1, targets[f]),) + s[f+1:]]
if (j != j1):
print(i, s, f, ' --> ' , j, j1)
assert(j == j1)
P[i,j] += p
Q = P[:-1, :-1]
I = np.identity(len(states)-1)
M = np.linalg.inv(I - Q)
return M[0,:].sum()
print(get_expected_n_trials(targets, dice_weights))
# 61.28361802372382
Explanations of code:
First we build the list of states using Cartesian product itertools.product
For a given state i and die face f, we need to calculate j = state reached from i when adding f. I have two ways of calculating that, either as j = i + base[f] * (s[f] < targets[f]) or as j = lookup[s[:f] + (min(s[f]+1, targets[f]),) + s[f+1:]]. Because I'm paranoid, I calculated it both ways and checked that the two ways gave the same result. But you only need one way. You can remove lines j1 = ... to assert(j == j1) if you want.
Matrix P begins filled with zeroes, and we fill up to six cells per row with P[i, j] += p where p is probability of rolling face f.
Then we compute matrices Q and M as I indicated above.
We return the sum of all the cells on the first row of M.
To help you better understand what is going on, I encourage you to examine the values of all variables. For instance you could replace return M[0, :].sum() with return states, base, lookup, P, Q, I, M and then write states, base, lookup, P, Q, I, M = get_expected_n_trials(targets, dice_weights) in the python interactive shell, so that you can look at the variables individually.
A Monte-Carlo simulation:
Actually roll the die until we hit the requirements;
Count how many rolls we did;
Repeat experiment 1000 times to get the empirical average value.
Implementation in python:
from collections import Counter
from random import choices
from itertools import accumulate
from statistics import mean, stdev
dice_weights = [0.25, 0.05, 0.2, 0.15, 0.2, 0.15]
targets = [1, 3, 0, 1, 2, 3]
def avg_n_trials(targets, dice_weights, n_experiments=1000):
dice_faces = range(len(targets))
target_state = Counter(dict(enumerate(targets)))
cum_weights = list(accumulate(dice_weights))
results = []
for _ in range(n_experiments):
state = Counter()
while not state >= target_state:
f = choices(dice_faces, cum_weights=cum_weights)[0]
state[f] += 1
results.append(state.total()) # python<3.10: sum(state.values())
m = mean(results)
s = stdev(results, xbar=m)
return m, s
m, s = avg_n_trials(targets, dice_weights, n_experiments=10000)
print(m)
# 61.4044

What's wrong with my Euclidean Distance Calculation? (Julia)

I'm trying to compute the Perceptually Important Points by using three different methods.
Euclidean Distance;
Perpendicular Distance;
Vertical Distance.
Method 2 and 3 gives me the same Point, but Euclidean distance not. Can't find the mistake I made. Hope someone can help me.
pt = 7.6 #pt
_t = 1 #t
ptT = 10.7 #p(t+T)
_T = 253 #t+T
# Distances
dE = Float64[] #Euclidean Distances
dP = Float64[] #Perpendicular Distances
dV = Float64[] #Vertical Distances
xi = Float64[] #x values
for i in 2:length(stockdf[:Price])-1
_de = sqrt((_t - i)^2 + (pt - stockdf[:Price][i])^2) + sqrt((_T - i)^2 + (ptT - stockdf[:Price][i])^2)
push!(dE,_de)
_dP = abs(_s*i+_c-stockdf[:Price][i])/sqrt(_s^2+1)
push!(dP,_dP)
_dV = abs(_s*i+_c-stockdf[:Price][i])
push!(dV,_dV)
push!(xi,i)
end
Both method 2 and 3 give me the max point indexed at 153, but method 1 gives me a point, which is not the max point and is indexed at 230.
Formula for the 3rd PIP with Euclidean Distance is:
dE = sqrt((t-i)^2 + (pt-pi)^2) + sqrt((t+T-i)^2+(pt+T-pi)^2)
EDIT:
For a better understanding I reproduced the code with other variables which you can test for yourself.
xs = Array(1:10)
ys = rand(1:1:10,10)
dde = Float64[]
ddP = Float64[]
ddV = Float64[]
xxi = Float64[]
# Connecting Line of first 2 PIPs
_ss = (ys[end]-ys[1])/10
_cc = ys[1]-(1*(ys[end]-ys[1]))/10
_zz = Float64[]
for i in 1:length(dedf[:Price])
push!(_zz,_ss*i+_cc)
end
for i in 2:length(xs)-1
_dde = sqrt((1-i)^2+(ys[1]-ys[i])) + sqrt((10-i)^2 + (ys[end]- ys[i])^2)
push!(dde,_dde)
_ddP = abs(_ss*i+_cc-ys[i])/sqrt(_ss^2+1)
push!(ddP,_ddP)
_ddV = abs(_ss*i+_cc-ys[i])
push!(ddV,_ddV)
push!(xxi,i)
end
println(dde)
for i in 1:length(dde)
if ddV[i] == maximum(ddV)
println(i)
end
end
For Euclidean Distance I get index 7
for Perpendicular and Vertical Distance I get index 5. Look at the graphs
Euclidean Distance on graph
Perpendicular Distance on graph
EDIT:
I'm working through a book about pattern recognition in financial time series. Now I downloaded the same data, which the book used and the now the results are the same. All of the 3 methods gave me the same index. But with different data sets, method 1 differs from 2 and 3. I don't know why.

Find nearest 3D point

I have two data files, each of them contain a big number of 3-dimensional points (file A stores approximately 50,000 points, file B stores approximately 500,000 points). My goal is to find for every point (a) in file A the point (b) in file B which has the smallest distance to (a). I store the points in two lists like this:
List A nodes:
(ID X Y Z)
[ ['478277', -107.0, 190.5674, 128.1634],
['478279', -107.0, 190.5674, 134.0172],
['478282', -107.0, 190.5674, 131.0903],
['478283', -107.0, 191.9798, 124.6807],
... ]
List B data:
(X Y Z Data)
[ [-28.102, 173.657, 229.744, 14.318],
[-28.265, 175.549, 227.824, 13.648],
[-27.695, 175.925, 227.133, 13.142],
...]
My first approach was to simply iterate through the first and second list with a nested loop and compute the distance between every points like this:
outfile = open(job[0] + '/' + output, 'wb');
dist_min = float(job[5]);
dist_max = float(job[6]);
dists = [];
for node in nodes:
shortest_distance = 1000.0;
shortest_data = 0.0;
for entry in data:
dist = math.sqrt((node[1] - entry[0])**2 + (node[2] - entry[1])**2 + (node[3] - entry[2])**2);
if (dist_min <= dist <= dist_max) and (dist < shortest_distance):
shortest_distance = dist;
shortest_data = entry[3];
outfile.write(node[0] + ', ' + str('%10.5f' % shortest_data + '\n'));
outfile.close();
I recognized that the amount of loops Python has to run is way too big (~25,000,000,000), so I had to fasten my code. I tried to first calculate all distances with list comprehensions but the code still is too slow:
p_x = [row[1] for row in nodes];
p_y = [row[2] for row in nodes];
p_z = [row[3] for row in nodes];
q_x = [row[0] for row in data];
q_y = [row[1] for row in data];
q_z = [row[2] for row in data];
dx = [[(px - qx) for px in p_x] for qx in q_x];
dy = [[(py - qy) for py in p_y] for qy in q_y];
dz = [[(pz - qz) for pz in p_z] for qz in q_z];
dx = [[dxxx * dxxx for dxxx in dxx] for dxx in dx];
dy = [[dyyy * dyyy for dyyy in dyy] for dyy in dy];
dz = [[dzzz * dzzz for dzzz in dzz] for dzz in dz];
D = [[(dx[i][j] + dy[i][j] + dz[i][j]) for j in range(len(dx[0]))] for i in range(len(dx))];
D = [[(DDD**(0.5)) for DDD in DD] for DD in D];
To be honest, at this point, I do not know which of the two approaches is better, anyway, none of the two possibilities seem feasible. I'm not even sure if it is possible to write a code which calculates all distances in an acceptable time. Is there even another way to solve my problem without calculating all distances?
Edit: I forgot to mention that I am running on Python 2.5.1 and am not allowed to install or add any new libraries...
Just in case someone is interrested in the solution:
I found a way to speed up the whole process by not calculating all distances:
I created a 3D-list, representing a grid in the given 3D space, divided in X, Y and Z in a given step size (e.g. (Max. - Min.) / 1,000). Then I iterated over every 3D point to put it into my grid. After that I iterated over the points of set A again, looking if there are points from B in the same cube, if not I would increase the search radius, so the process is looking in the adjacent 26 cubes for points. The radius is increasing until there is at least one point found. The resulting list is comparatively small and can be ordered in short time and the nearest point is found.
The processing time went down to a couple minutes and it is working fine.
p_x = [row[1] for row in nodes];
p_y = [row[2] for row in nodes];
p_z = [row[3] for row in nodes];
q_x = [row[0] for row in data];
q_y = [row[1] for row in data];
q_z = [row[2] for row in data];
min_x = min(p_x + q_x);
min_y = min(p_y + q_y);
min_z = min(p_z + q_z);
max_x = max(p_x + q_x);
max_y = max(p_y + q_y);
max_z = max(p_z + q_z);
max_n = max(max_x, max_y, max_z);
min_n = min(min_x, min_y, max_z);
gridcount = 1000;
step = (max_n - min_n) / gridcount;
ruler_x = [min_x + (i * step) for i in range(gridcount + 1)];
ruler_y = [min_y + (i * step) for i in range(gridcount + 1)];
ruler_z = [min_z + (i * step) for i in range(gridcount + 1)];
grid = [[[0 for i in range(gridcount)] for j in range(gridcount)] for k in range(gridcount)];
for node in nodes:
loc_x = self.abatemp_get_cell(node[1], ruler_x);
loc_y = self.abatemp_get_cell(node[2], ruler_y);
loc_z = self.abatemp_get_cell(node[3], ruler_z);
if grid[loc_x][loc_y][loc_z] is 0:
grid[loc_x][loc_y][loc_z] = [[node[1], node[2], node[3], node[0]]];
else:
grid[loc_x][loc_y][loc_z].append([node[1], node[2], node[3], node[0]]);
for entry in data:
loc_x = self.abatemp_get_cell(entry[0], ruler_x);
loc_y = self.abatemp_get_cell(entry[1], ruler_y);
loc_z = self.abatemp_get_cell(entry[2], ruler_z);
if grid[loc_x][loc_y][loc_z] is 0:
grid[loc_x][loc_y][loc_z] = [[entry[0], entry[1], entry[2], entry[3]]];
else:
grid[loc_x][loc_y][loc_z].append([entry[0], entry[1], entry[2], entry[3]]);
out = [];
outfile = open(job[0] + '/' + output, 'wb');
for node in nodes:
neighbours = [];
radius = -1;
loc_nx = self.abatemp_get_cell(node[1], ruler_x);
loc_ny = self.abatemp_get_cell(node[2], ruler_y);
loc_nz = self.abatemp_get_cell(node[3], ruler_z);
reloop = True;
while reloop:
if neighbours:
reloop = False;
radius += 1;
start_x = 0 if ((loc_nx - radius) < 0) else (loc_nx - radius);
start_y = 0 if ((loc_ny - radius) < 0) else (loc_ny - radius);
start_z = 0 if ((loc_nz - radius) < 0) else (loc_nz - radius);
end_x = (len(ruler_x) - 1) if ((loc_nx + radius + 1) > (len(ruler_x) - 1)) else (loc_nx + radius + 1);
end_y = (len(ruler_y) - 1) if ((loc_ny + radius + 1) > (len(ruler_y) - 1)) else (loc_ny + radius + 1);
end_z = (len(ruler_z) - 1) if ((loc_nz + radius + 1) > (len(ruler_z) - 1)) else (loc_nz + radius + 1);
for i in range(start_x, end_x):
for j in range(start_y, end_y):
for k in range(start_z, end_z):
if not grid[i][j][k] is 0:
for grid_entry in grid[i][j][k]:
if not isinstance(grid_entry[3], basestring):
neighbours.append(grid_entry);
dists = [];
for n in neighbours:
d = math.sqrt((node[1] - n[0])**2 + (node[2] - n[1])**2 + (node[3] - n[2])**2);
dists.append([d, n[3]]);
dists = sorted(dists);
outfile.write(node[0] + ', ' + str(dists[0][-1]) + '\n');
outfile.close();
Function to get the position of a point:
def abatemp_get_cell(self, n, ruler):
for i in range(len(ruler)):
if i >= len(ruler):
return False;
if ruler[i] <= n <= ruler[i + 1]:
return i;
The gridcount variable gives one the chance to fasten the process, with a small gridcount the process of sorting the points into the grid is very fast, but the lists of neighbours in the search loop gets bigger and more time is needed for this part of the process. With a big gridcount more time is needed at the beginning, however the loop runs faster.
The only issue I have now is the fact, that there are cases when the process found neighbours but there are other points, which are not yet found, but are closer to the point (see picture). So far I solved this issue by incrementing the search radius another time when there are already neigbours. And still then I have points which are closer but not in the neighbours list, although it's a very small amount (92 out of ~100,000). I could solve this problem by increment the radius two times after finding neighbours, but this solution seems not very smart. Maybe you guys have an idea...
This is the first working draft of the process, I think it will be possible to improve it even more, just to give you an idea of how it is working...
It took me a bit of thought but at the end I think I found a solution for you.
Your problem is not in the code you wrote but in the algorithm it implements.
There is an algorithm called Dijkstra's algorithm and here is the gist of it: https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm .
Now what you need to do is to use this algorithm in a clever way:
create a node S (stand for source).
Now link edges from S to all the nodes in B group.
After you done that you should link edges from each point b in B to each point a in A.
You should set the cost of the links from the source to 0 and the other to the distance between 2 points (only in 3D).
Now if we will use Dijkstra's algorithm the output we will get would be the cost to travel from S to each point in the graph (we are only interested in the distance to points in group A).
So since the cost is 0 to each point b in B and S is only connected to points in B so the road to any point a in A must include a node in B (actually exactly one since the shortest distance between to points is a single line).
I am not sure if this will fasten your code but as far as I know, a way to solve this problem without calculating all distances does not exist and this algorithm is the best time complexity one could hope for.
take a look at this generic 3D data structure:
https://github.com/m4nh/skimap_ros
it has a very fast RadiusSearch feature just ready to be used. This solution (similar to Octree but faster) avoids to you to create the Regular Grid first (you don't have to fix MAX/MIN size along each axis) and you save a lot of memory

Minimum Weight Triangulation Taking Forever

so I've been working on a program in Python that finds the minimum weight triangulation of a convex polygon. This means that it finds the weight(The sum of all the triangle perimeters), as well as the list of chords(lines going through the polygon that break it up into triangles, not the boundaries).
I was under the impression that I'm using the dynamic programming algorithm, however when I tried using a somewhat more complex polygon it takes forever(I'm not sure how long it takes because I haven't gotten it to finish).
It works fine with a 10 sided polygon, however I'm trying 25 and that's what is making it stall. My teacher gave me the polygons so I assume that the 25 one is supposed to work as well.
Since this algorithm is supposed to be O(n^3), the 25 sided polygon should take roughly 15.625 times longer to calculate, however it's taking way longer seeing that the 10 sided seems instantaneous.
Am I doing some sort of n operation in there that I'm not realizing? I can't see anything I'm doing, except maybe the last part where I get rid of the duplicates by turning the list into a set, however in my program I put a trace after the decomp before the conversion happens, and it's not even reaching that point.
Here's my code, if you guys need anymore info just please ask. Something in there is making it take longer than O(n^3) and I need to find it so I can trim it out.
#!/usr/bin/python
import math
def cost(v):
ab = math.sqrt(((v[0][0] - v[1][0])**2) + ((v[0][1] - v[1][1])**2))
bc = math.sqrt(((v[1][0] - v[2][0])**2) + ((v[1][1] - v[2][1])**2))
ac = math.sqrt(((v[0][0] - v[2][0])**2) + ((v[0][1] - v[2][1])**2))
return ab + bc + ac
def triang_to_chord(t, n):
if t[1] == t[0] + 1:
# a and b
if t[2] == t[1] + 1:
# single
# b and c
return ((t[0], t[2]), )
elif t[2] == n-1 and t[0] == 0:
# single
# c and a
return ((t[1], t[2]), )
else:
# double
return ((t[0], t[2]), (t[1], t[2]))
elif t[2] == t[1] + 1:
# b and c
if t[0] == 0 and t[2] == n-1:
#single
# c and a
return ((t[0], t[1]), )
else:
#double
return ((t[0], t[1]), (t[0], t[2]))
elif t[0] == 0 and t[2] == n-1:
# c and a
# double
return ((t[0], t[1]), (t[1], t[2]))
else:
# triple
return ((t[0], t[1]), (t[1], t[2]), (t[0], t[2]))
file_name = raw_input("Enter the polygon file name: ").rstrip()
file_obj = open(file_name)
vertices_raw = file_obj.read().split()
file_obj.close()
vertices = []
for i in range(len(vertices_raw)):
if i % 2 == 0:
vertices.append((float(vertices_raw[i]), float(vertices_raw[i+1])))
n = len(vertices)
def decomp(i, j):
if j <= i: return (0, [])
elif j == i+1: return (0, [])
cheap_chord = [float("infinity"), []]
old_cost = cheap_chord[0]
smallest_k = None
for k in range(i+1, j):
old_cost = cheap_chord[0]
itok = decomp(i, k)
ktoj = decomp(k, j)
cheap_chord[0] = min(cheap_chord[0], cost((vertices[i], vertices[j], vertices[k])) + itok[0] + ktoj[0])
if cheap_chord[0] < old_cost:
smallest_k = k
cheap_chord[1] = itok[1] + ktoj[1]
temp_chords = triang_to_chord(sorted((i, j, smallest_k)), n)
for c in temp_chords:
cheap_chord[1].append(c)
return cheap_chord
results = decomp(0, len(vertices) - 1)
chords = set(results[1])
print "Minimum sum of triangle perimeters = ", results[0]
print len(chords), "chords are:"
for c in chords:
print " ", c[0], " ", c[1]
I'll add the polygons I'm using, again the first one is solved right away, while the second one has been running for about 10 minutes so far.
FIRST ONE:
202.1177 93.5606
177.3577 159.5286
138.2164 194.8717
73.9028 189.3758
17.8465 165.4303
2.4919 92.5714
21.9581 45.3453
72.9884 3.1700
133.3893 -0.3667
184.0190 38.2951
SECOND ONE:
397.2494 204.0564
399.0927 245.7974
375.8121 295.3134
340.3170 338.5171
313.5651 369.6730
260.6411 384.6494
208.5188 398.7632
163.0483 394.1319
119.2140 387.0723
76.2607 352.6056
39.8635 319.8147
8.0842 273.5640
-1.4554 226.3238
8.6748 173.7644
20.8444 124.1080
34.3564 87.0327
72.7005 46.8978
117.8008 12.5129
162.9027 5.9481
210.7204 2.7835
266.0091 10.9997
309.2761 27.5857
351.2311 61.9199
377.3673 108.9847
390.0396 148.6748
It looks like you have an issue with the inefficient recurision here.
...
def decomp(i, j):
...
for k in range(i+1, j):
...
itok = decomp(i, k)
ktoj = decomp(k, j)
...
...
You've ran into the same kind of issue as a naive recursive implementation of the Fibonacci Numbers, but the way this algorithm works, it'll probably be much worst on the run time. Assuming that is the only issue with you're algorithm, then you just need to use memorization to ensure that the decomp is only calculated once for each unique input.
The way to spot this issue is to print out the values of i, j and k as the triple (i,j,k). In order to obtain a runtime of O(N^3), you shouldn't see the same exact triple twice. However, the triple (22, 24, 23), appears at least twice (in the 25), and is the first such duplicate. That shows the algorithm is calculating the same thing multiple times, which is inefficient, and is bumping up the performance well past O(N^3). I'll leave figuring out what the algorithms actual performance is to you as an exercise. Assuming there isn't something else wrong with the algorithm the algorithm should eventually stop.

Resources