I have a square matrix in R containing distances between cities:
set.seed(3)
x <- matrix(sample(1:15, 16, replace = TRUE), nrow = 4)
x
# [,1] [,2] [,3] [,4]
#[1,] 3 10 9 9
#[2,] 13 10 10 9
#[3,] 6 2 8 14
#[4,] 5 5 8 13
Every row represents a city from which a courier can be sent, and every column represents a city where a package has to be delivered. All couriers have the same package, so they can be assigned to every other city.
Normally I would use the Hungarian algorithm in clue::solve_LSAP() to find an optimal assignment so that the total cost (in this case total distance) would be minimized:
y <- clue::solve_LSAP(x)
y
#Optimal assignment:
#1 => 1, 2 => 4, 3 => 2, 4 => 3
However, in this specific case I would like minimize the spread of the distances.
I have been searching for quite some time now, and I found the following here in this book at page 270 (and something similar here in this book at page 195):
So the stated objective is to minimize the difference between the maximum and minimum assigned distance, which is exactly what I am looking for. The assignment of the Hungarian algorithm gives the following difference between maximum and minimum distance:
distances <- x[cbind(1:4, y)]
max(distances) - min(distances)
#[1] 7
However, the best assignment to minimize the new objective is (this solution was found by brute force):
#Optimal assignment to minimize the new objective:
#1 => 2, 2 => 4, 3 => 1, 4 => 3
yNew <- c(2, 4, 1, 3)
distancesNew <- x[cbind(1:4, yNew)]
max(distancesNew) - min(distancesNew)
#[1] 4
So clearly, the Hungarian algorithm doesn't give the desired solution.
Now my question: is there any existing R code that finds an optimal assignment by minimizing the objective mentioned above (the difference between the maximum and minimum assigned cost value)? Or maybe some R code with an algorithm that achieves a similar result?
Those books I mentioned describe the algorithm I want, but 1) they both start with the solution of the bottleneck assignment problem (an algorithm for which I couldn't find R code either) and 2) implementing this algorithm (and possibly also the bottleneck assignment algorithm) myself would be far from ideal.
Related
I have a N by T matrix of data (N >>T), which have many missing NA values. I am searching for the best row-subset of my matrix, such that any two rows N_i and N_j will have at least k>=2 co-occurring (i.e. occurring at same time t) non-NA values. To motivate this, my end goal of finding this subset is to be able to compute a pairwise correlation matrix (using R's cor(., use="pairwise.complete.obs") as initial estimator for a shrinkage procedure.
Formally, if X is my raw matrix, let Y be the matrix where each entry has value 1 if corresponding X entry has an observed value, 0 for NA. Then YY' will be a N x N matrix counting co-occurrences of non-NA values. Initially this YY' matrix will contain 0 values, indicating that row N_i and N_j have 0 co-occurring values. I want to find the largest subset of X such that the minimum of YY' is say k=2 , or 3, etc.
Is there any algorithm to solve this question, with ideally an implementation in R? I have a feeling this YY' matrix can be considered like a graph, and that there might exist known procedures for this question (clique finding? Steiner tree?) but I am not knowledgeable enough in this domain... I can think of a greedy iterative algorithm removing the worst row each time and re-computing YY, but was hoping there would be a more elegant and faster solution?
Quick simulation of the data:
set.seed(123)
X <- matrix(rnorm(200*20), 200, 20)
X[sample(1:(200*20), 200*20/2)] <- NA
Y <- 1*!is.na(X)
YY <- tcrossprod(Y)
YY[1:3,1:3]
#> [,1] [,2] [,3]
#> [1,] 9 6 3
#> [2,] 6 10 4
#> [3,] 3 4 11
min(YY)
#> [1] 0
Created on 2020-09-28 by the reprex package (v0.3.0)
Unfortunately this problem is NP-hard. Given a graph with n vertices and m edges, we can reduce the problem of finding a max clique to solving this problem with k=1 on the vertex-edge incidence matrix (duplicate columns for higher k).
I'd turn the reduction around in the obvious way and reduce this problem to max clique.
The vertices of the graph correspond to rows. There is an edge between two vertices iff the corresponding rows have enough non-NA coöccurring values. To find the max clique, you can call out to the igraph R package.
I have some troubles regarding a linear program I alreday solved and use excel but now i want to do it in r/python beacuse I already reach excels and the solvers limits. Therefore I am asking for help on this specific topic.
I tried it with the lPsovle package by also altering the lp.assign function but I cannot come up with an solution.
The problem is as follows:
Let's say I am a deliverer of an commodity good.
I have differnet depots which serve different areas. These areas MUST be served with their demands.
My depots on the other hand, have a constraint regarding their capacity what they can handle and deliver.
One depot can serve several areas, but one area can only be served by one depot.
I have the distance/cost matrix for the connections between depots and areas as well as the demand for that areas.
The objective for this solution should be that the areas should be served with the minimal possible effort.
Lets say the cost/distance matrix looks something like this:
assign.costs <- matrix (c(2, 7, 7, 2, 7, 7, 3, 2, 7, 2, 8, 10, 1, 9, 8, 2,7,8,9,10), 4, 10)
So this creates my matrix, with the costumers/areas in the first row/header and the depots in the first column/row names.
Now the demand of the areas/customers is:
assign.demand <- matrix (c(1,2,3,4,5,6,7,8,9,10), 1, 10)
The capacity restrictions, what amount the depos are able to serve is:
assign.capacity <- matrix (c(15,15,15,15), 4, 1)
So now i woukd like this problem to be solved by a lp to generate the allocation, which area should be served by which depot according to these restrictions.
The result should look something like this:
assign.solution <- matrix (c(1,0,0,0 ,0,1,0,0, 1,0,0,0, 1,0,0,0 ,0,0,0,1), 4, 10)
As for the restrictions this means that every column must some up to one.
I tried it with the lpsolve and lp.assign functions from lpSolve but I dont know exactly how to implement that exact kind of restrictions I have and i already tried to alter the lp.assign functions with no success.
If it helps, i can also formulate the equations for the lp.
Thank you all for your help, I am really stuck right now :D
BR
Step 1. Develop a mathematical model
The mathematical model can look like:
The blue entries represent data and the red ones indicate a decision variable. i are the depots and j are the customers. Ship indicates if we ship from i to j (it is a binary variable). The first constraint says that total amount shipped from depot i should not exceed its capacity. The second constraint says that there must be exactly one supplier i for each customer j.
Step 2. Implementation
This is now just a question of being precise. I follow the model from the previous section as closely as I can.
library(dplyr)
library(tidyr)
library(ROI)
library(ROI.plugin.symphony)
library(ompr)
library(ompr.roi)
num_depots <- 4
num_cust <- 10
cost <- matrix(c(2, 7, 7, 2, 7, 7, 3, 2, 7, 2, 8, 10, 1, 9, 8, 2,7,8,9,10), num_depots, num_cust)
demand <- c(1,2,3,4,5,6,7,8,9,10)
capacity <- c(15,15,15,15)
m <- MIPModel() %>%
add_variable(ship[i,j], i=1:num_depots, j=1:num_cust, type="binary") %>%
add_constraint(sum_expr(demand[j]*ship[i,j], j=1:num_cust) <= capacity[i], i=1:num_depots) %>%
add_constraint(sum_expr(ship[i,j], i=1:num_depots) == 1, j=1:num_cust) %>%
set_objective(sum_expr(cost[i,j]*ship[i,j], i=1:num_depots, j=1:num_cust),"min") %>%
solve_model(with_ROI(solver = "symphony", verbosity=1))
cat("Status:",solver_status(m),"\n")
cat("Objective:",objective_value(m),"\n")
get_solution(m,ship[i, j]) %>%
filter(value > 0)
We see how important it is to first write down a mathematical model. It is much more compact and easier to reason about than a bunch of code. Going directly to code often leads to all kind of problems. Like building a house without a blueprint. Even for this small example, writing down the mathematical model is a useful exercise.
For the implementation I used OMPR instead of the LpSolve package because OMPR allows me to stay closer to the mathematical model. LpSolve has a matrix interface, which is very difficult to use except for very structured models.
Step 3: Solve it
Status: optimal
Objective: 32
variable i j value
1 ship 1 1 1
2 ship 4 2 1
3 ship 2 3 1
4 ship 1 4 1
5 ship 3 5 1
6 ship 4 6 1
7 ship 4 7 1
8 ship 2 8 1
9 ship 1 9 1
10 ship 3 10 1
I believe this is the correct solution.
I inherited the code of a co-worker and now I'm trying to solve some issues that arose in the last days (Unfortunately I don't have time to doctor ALL the code and I'm approaching each bug as pragmatically as possible). In particular, a bug obligates the program to work with at least to paramaters (when one should be enough).
var_pre <- vapply(1:repetitions, function (i) {...}, numeric(parameters))
This vapply seems to be the source of the bug. The function in the vapply produces n atomic vectors of length = parameters (where n = repetitions). The vapply will produce an atomic vector of length = repetitions if parameters = 1.
For instance, If parameters = 1 and repetitions = 3, function will produce [1] 1, [1] 2, [1] 3. And var_pre will store
[1] 1 2 3.
However, if parameters > 1, the output will be a matrix with as many rows as parameters and number of columns equal to repetitions.
For instance, If parameters = 2 and repetitions = 3, function will produce [1] 1 2, [1] 3 4, [1] 5 6. And var_pre will store a matrix
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Since var_pre is further processed in the code, this issues destroys the overall workflow of the program. I want to solve this problem in the cleanest way possible, without messing with the function. I have to understand why vapply is producing results of different class.
Is there an explanation for this?
EDIT: The program uses the matrix form later. Therefore, when parameters = 1 and repetitions = 3, I need a matrix like
[,1] [,2] [,3]
[1,] 1 2 3
I am solving a constrained optimization problem in R iteratively where the problem if of the following form:
min f(x)
s.t. g(x) < 0
I am doing the optimization iteratively and so I have a variable called f that holds the current value of the function f and vector g of 1's and 0's that tells me if the constraint was met for the ith iteration. To help, here is an example
#Imagine there are 6 time points
f = c(7,10,6,3,-1,-9)
g = c(1,1,0,1,1,0)
And so I want to keep track of the best minimum value found at the ith iteration such that the constraint was satisfied. Thus I want to be able to calculate a vector like
h = c(7,7,7,3,-1,-1)
Where h records the minimum value found in time such that the constraint was met. Is there an easy way to do this in R either in loops or not?
Given your f and g, I would do it like this:
cummin(zoo::na.locf(ifelse(g == 1, f, NA)))
# [1] 7 7 7 3 -1 -1
Or, perhaps more simply:
cummin((ifelse(g == 1, f, Inf)))
# [1] 7 7 7 3 -1 -1
I can have any number row which consists from 2 to 10 numbers. And from this row, I have to get geometrical progression.
For example:
Given number row: 125 5 625 I have to get answer 5. Row: 128 8 512 I have to get answer 4.
Can you give me a hand? I don't ask for a program, just a hint, I want to understand it by myself and write a code by myself, but damn, I have been thinking the whole day and couldn't figure this out.
Thank you.
DON'T WRITE THE WHOLE PROGRAM!
Guys, you don't get it, I can't just simple make a division. I actually have to get geometrical progression + show all numbers. In 128 8 512 row all numbers would be: 8 32 128 512
Seth's answer is the right one. I'm leaving this answer here to help elaborate on why the answer to 128 8 512 is 4 because people seem to be having trouble with that.
A geometric progression's elements can be written in the form c*b^n where b is the number you're looking for (b is also necessarily greater than 1), c is a constant and n is some arbritrary number.
So the best bet is to start with the smallest number, factorize it and look at all possible solutions to writing it in the c*b^n form, then using that b on the remaining numbers. Return the largest result that works.
So for your examples:
125 5 625
Start with 5. 5 is prime, so it can be written in only one way: 5 = 1*5^1. So your b is 5. You can stop now, assuming you know the row is in fact geometric. If you need to determine whether it's geometric then test that b on the remaining numbers.
128 8 512
8 can be written in more than one way: 8 = 1*8^1, 8 = 2*2^2, 8 = 2*4^1, 8 = 4*2^1. So you have three possible values for b, with a few different options for c. Try the biggest first. 8 doesn't work. Try 4. It works! 128 = 2*4^3 and 512 = 2*4^4. So b is 4 and c is 2.
3 15 375
This one is a bit mean because the first number is prime but isn't b, it's c. So you'll need to make sure that if your first b-candidate doesn't work on the remaining numbers you have to look at the next smallest number and decompose it. So here you'd decompose 15: 15 = 15*?^0 (degenerate case), 15 = 3*5^1, 15 = 5*3^1, 15 = 1*15^1. The answer is 5, and 3 = 3*5^0, so it works out.
Edit: I think this should be correct now.
This algorithm does not rely on factoring, only on the Euclidean Algorithm, and a close variant thereof. This makes it slightly more mathematically sophisticated then a solution that uses factoring, but it will be MUCH faster. If you understand the Euclidean Algorithm and logarithms, the math should not be a problem.
(1) Sort the set of numbers. You have numbers of the form ab^{n1} < .. < ab^{nk}.
Example: (3 * 2, 3*2^5, 3*2^7, 3*2^13)
(2) Form a new list whose nth element of the (n+1)st element of the sorted list divided by the (n)th. You now have b^{n2 - n1}, b^{n3 - n2}, ..., b^{nk - n(k-1)}.
(Continued) Example: (2^4, 2^2, 2^6)
Define d_i = n_(i+1) - n_i (do not program this -- you couldn't even if you wanted to, since the n_i are unknown -- this is just to explain how the program works).
(Continued) Example: d_1 = 4, d_2 = 2, d_3 = 6
Note that in our example problem, we're free to take either (a = 3, b = 2) or (a = 3/2, b = 4). The bottom line is any power of the "real" b that divides all entries in the list from step (2) is a correct answer. It follows that we can raise b to any power that divides all the d_i (in this case any power that divides 4, 2, and 6). The problem is we know neither b nor the d_i. But if we let m = gcd(d_1, ... d_(k-1)), then we CAN find b^m, which is sufficient.
NOTE: Given b^i and b^j, we can find b^gcd(i, j) using:
log(b^i) / log(b^j) = (i log b) / (j log b) = i/j
This permits us to use a modified version of the Euclidean Algorithm to find b^gcd(i, j). The "action" is all in the exponents: addition has been replaced by multiplication, multiplication with exponentiation, and (consequently) quotients with logarithms:
import math
def power_remainder(a, b):
q = int(math.log(a) / math.log(b))
return a / (b ** q)
def power_gcd(a, b):
while b != 1:
a, b = b, power_remainder(a, b)
return a
(3) Since all the elements of the original set differ by powers of r = b^gcd(d_1, ..., d_(k-1)), they are all of the form cr^n, as desired. However, c may not be an integer. Let me know if this is a problem.
The simplest approach would be to factorize the numbers and find the greatest number they have in common. But be careful, factorization has an exponential complexity so it might stop working if you get big numbers in the row.
What you want is to know the Greatest Common Divisor of all numbers in a row.
One method is to check if they all can be divided by the smaller number in the row.
If not, try half the smaller number in the row.
Then keep going down until you find a number that divides them all or your divisor equals 1.
Seth Answer is not correct, applyin that solution does not solves 128 8 2048 row for example (2*4^x), you get:
8 128 2048 =>
16 16 =>
GCD = 16
It is true that the solution is a factor of this result but you will need to factor it and check one by one what is the correct answer, in this case you will need to check the solutions factors in reverse order 16, 8, 4, 2 until you see 4 matches all the conditions.