I have the following problem. I have n (typically n = 1000) data points (integers from {1,2,3}, so there are a lot of repeting numbers) and a real number d. I have to choose k<n points (k is given) which minimize the distance between the mean of those k points and point d. This can be expressed as a MILP problem (please see here).
I tried to solve that in R using lpSolve and Rglpk packages but it takes a lot of time to solve it (I tried to solve it for n = 100 points and the code has been running for 40 minutes already). I guess the issue is that there are lots of binary variables (n) and there are also repeating numbers.
library(Rglpk)
set.seed(123)
sampsize <- sample(c(1,2,3),size=100,replace = TRUE)
k <- 50
d <- 86/47
lngth <- length(sampsize)
f.obj <- c(1,rep(0,lngth))
f.con <- matrix(c(0,rep(1,lngth),-1,sampsize/k,1,sampsize/k),nrow=3, byrow = TRUE)
f.dir <- c("==","<=",">=")
f.rhs <- c(k,d,d)
f.types <- c("C",rep("B",lngth))
res <- Rglpk_solve_LP(obj=f.obj,mat=f.con,dir=f.dir,rhs=f.rhs,max=FALSE,types=f.types)
I will be satisfied with a sub-optimal solution. Is there a way to solve it quickly or re-express the problem in a certain way to speed up the algorithm?
I would appreciate any input on this.
CVXR is a much better tool for this:
#
# generate random data
#
set.seed(123)
N <- 100 # sample size
v <- c(1,2,3) # sample from these unique values
M <- length(v) # number of unique values
data <- sample(v, size=N, replace=TRUE)
tab <- table(data) # tabulate
K <- 50 # number of points to choose
target <- 86/47 # target for mean
#
# CVXR model
# see https://cvxr.rbind.io/
#
library(CVXR)
# select a number of values from each bin
select <- Variable(M,integer=T)
# obj: sum of absolute deviations
objective = Minimize(abs(sum(v*select)/K-target))
# include nonnegativity constraints
constraints = list(sum(select)==K, select >= 0, select <= vec(tab))
problem <- Problem(objective, constraints)
sol <- solve(problem,verbose=T)
cat("\n")
cat("Status:",sol$status,"\n")
cat("Objective:",sol$value,"\n")
cat("Solution:",sol$getValue(select),"\n")
Output:
GLPK Simplex Optimizer, v4.65
9 rows, 4 columns, 17 non-zeros
0: obj = 0.000000000e+00 inf = 5.183e+01 (2)
3: obj = 5.702127660e-01 inf = 0.000e+00 (0)
* 4: obj = 1.065814104e-16 inf = 0.000e+00 (0)
OPTIMAL LP SOLUTION FOUND
GLPK Integer Optimizer, v4.65
9 rows, 4 columns, 17 non-zeros
3 integer variables, none of which are binary
Integer optimization begins...
Long-step dual simplex will be used
+ 4: mip = not found yet >= -inf (1; 0)
+ 55: >>>>> 1.021276596e-02 >= 9.787234043e-03 4.2% (52; 0)
+ 56: >>>>> 9.787234043e-03 >= 9.787234043e-03 < 0.1% (16; 36)
+ 56: mip = 9.787234043e-03 >= tree is empty 0.0% (0; 103)
INTEGER OPTIMAL SOLUTION FOUND
Status: optimal
Objective: 0.009787234
Solution: 26 7 17
The below is written in python, but I think the concept conveys very easily and can be reformulated in r if desired.
Basically: Reformulate your problem. Instead of optimizing a long vector of binary "selection" variables, all you need is 3 variables to formulate this, specifically the (integer) number of 1's, 2's, and 3's to pick.
This solves almost instantaneously as an IP.
import pyomo.environ as pyo
from random import randint
n = 1000
k = 500
sample = [randint(1, 3) for t in range(n)]
avail = {t : len([val for val in sample if val==t]) for t in range(1, 4)}
target = 86/47
m = pyo.ConcreteModel()
m.vals = pyo.Set(initialize=[1,2,3])
m.pick = pyo.Var(m.vals, domain=pyo.NonNegativeIntegers)
m.delta = pyo.Var()
m.obj = pyo.Objective(expr=m.delta)
# constrain the delta as an absolute value of |sum(picks) - target|
m.C1 = pyo.Constraint(expr=m.delta >= sum(m.pick[v]*v for v in m.vals)-target*k)
m.C2 = pyo.Constraint(expr=m.delta >= -sum(m.pick[v]*v for v in m.vals)+target*k)
# don't use more than available for each value
def limit(m, v):
return m.pick[v] <= avail[v]
m.C3 = pyo.Constraint(m.vals, rule=limit)
soln = pyo.SolverFactory('glpk').solve(m)
print(soln)
m.pick.display()
Yields:
Solver:
- Status: ok
Termination condition: optimal
Statistics:
Branch and bound:
Number of bounded subproblems: 885
Number of created subproblems: 885
Error rc: 0
Time: 0.3580749034881592
Solution:
- number of solutions: 0
number of solutions displayed: 0
pick : Size=3, Index=vals
Key : Lower : Value : Upper : Fixed : Stale : Domain
1 : 0 : 3.0 : None : False : False : NonNegativeIntegers
2 : 0 : 0.0 : None : False : False : NonNegativeIntegers
3 : 0 : 304.0 : None : False : False : NonNegativeIntegers
Realize you can also attack this algorithmically quite efficiently and get a (pretty easy) near-optimal answer, or with some sweat-equity get the optimal answer as well. Below is a framework that I tinkered with. The key observation is that you can "add more 3's" to the solution up until the point where the amount to go (to get to k * target can be filled all with 1's. That's very close to as-good-as-it-gets, except for cases where you'd be better off substituting a couple of 2's near the end, I think, or backing up if you run out of 1's.
The below runs (in python) and is most of the way there for a good approximation.
### Code:
# average hitting
from random import randint
n = 1000
k = 50
sample = [randint(1, 3) for t in range(n)]
available = {t : len([val for val in sample if val==t]) for t in range(1, 4)}
target = 86/47
print(f'available at start: {available}')
sum_target = target * k
soln = []
selections_remaining = k
togo = sum_target - sum(soln)
for pick in range(k):
if togo > k - pick and available[3] > 0:
soln.append(3)
available[3] -= 1
elif togo > k - pick and available[2] > 0:
soln.append(2)
available[2] -= 1
elif available[1] > 0:
soln.append(1)
available[1] -= 1
else: # ran out of ones in home stretch... do a swap
pass
# some more logic...
togo = sum_target - sum(soln)
print(f'solution: {soln}')
print(f'generated: {sum(soln)/k} for target of {target}')
print(f'leftover: {available}')
Yields:
available at start: {1: 349, 2: 335, 3: 316}
solution: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
generated: 1.84 for target of 1.8297872340425532
leftover: {1: 291, 2: 335, 3: 274}
[Finished in 117ms]
Related
In portfolio analysis, given the expectation, we aim to find the weight of each asset to minimize the variance
here is the code
install.packages("quadprog")
library(quadprog)
#Denoting annualized risk as an vector sigma
sigma <- c(0.56, 7.77, 13.48, 16.64)
#Formulazing the correlation matrix proposed by question
m <- diag(0.5, nrow = 4, ncol = 4)
m[upper.tri(m)] <- c(-0.07, -0.095, 0.959, -0.095, 0.936, 0.997)
corr <- m + t(m)
sig <- corr * outer(sigma, sigma)
#Defining the mean
mu = matrix(c(1.73, 6.65, 9.11, 10.30), nrow = 4)
m0 = 8
Amat <- t(matrix(c(1, 1, 1, 1,
c(mu),
1, 0, 0, 0,
0, 1, 0, 0,
0, 0, 1, 0,
0, 0, 0, 1), 6, 4, byrow = TRUE))
bvec <- c(1, m0, 0, 0, 0, 0)
qp <- solve.QP(sig, rep(0, nrow(sig)), Amat, bvec, meq = 2)
qp
x = matrix(qp$solution)
x
(t(x) %*% sig %*% x)^0.5
I understand the formulation of mu and covariance matrix and know the usage of the quadprog plot
However, I don‘t understand why Amat and bvec are defined in this way, why the are 6 by 4 matrix.
$mu0$ is the expectation we aim to have for the portfolio and it is fixed at value 8%
Attached is the question
As you are probably aware, the reason that Amat has four columns is that there are four assets that you are allocating over. It has six rows because there are six constraints in your problem:
The allocations add up to 1 (100%)
Expected return = 8%
'Money market' allocation >= 0
'Capital stable' allocation >= 0
'Balance' allocation >= 0
'Growth' allocation >= 0
Look at the numbers that define each constraint. They are why bvec is [1, 8, 0, 0, 0, 0]. Of these six, the first two are equality constraints, which is why meq is set to 2 (the other four are greater than or equal constraints).
Edited to add:
The way the constraints work is this: each column of Amat defines a constraint, which is then multiplied by the asset allocations, with the result equal to (or greater-than-or-equal-to) some target that is set in bvec. For example:
The first column of Amat is [1, 1, 1, 1], and the first entry of bvec is 1. So the first constraint is:
1 * money_market + 1 * capital_stable + 1 * balance + 1 * growth = 1
This is a way of saying that the asset allocations add up to 1.
The second constraint says that the expected returns add up to 8:
1.73 * money_market + 6.65 * capital_stable + 9.11 * balance + 10.32 * growth = 8
Now consider the third constraint, which says that the 'Money market' allocation is greater than or equal to zero. That's because the 3rd column of Amat is [1, 0, 0, 0] and the third entry of bvec is 0. So this constraint looks like:
1 * money_market + 0 * capital_stable + 0 * balance + 0 * growth >= 0
Simplifying, that's the same as:
money_market >= 0
I am using the R package "pt" to calculate the cummulative prospect theory value.
The first input is the following:
choice_ids <- c(1, 1, 1, 1, 2, 2, 2, 2)
gamble_ids <- c(1, 1, 1, 2, 1, 1, 2, 2)
outcome_ids <- c(1, 2, 3, 1, 1, 2, 1, 2)
objective_consequences <- c(2500, 2400, 0, 2400,2500, 0, 2400, 0)
probability_strings <- c("0.33", "0.66", "0.01", "1.0","0.33", "0.67", "0.34", "0.66")
my_choices <- Choices(choice_ids=choice_ids,gamble_ids=gamble_ids,outcome_ids=outcome_ids,objective_consequences=objective_consequences,probability_strings=probability_strings)
Afterwards
tk_1992_utility <- Utility(fun="power", par=c(alpha=0.88, beta=0.88, lambda=2.25))
linear_in_log_odds_prob_weight <- ProbWeight(fun="linear_in_log_odds", par=c(alpha=0.61, beta=0.724))
comparePT(my_choices,prob_weight_for_positive_outcomes=linear_in_log_odds_prob_weight,prob_weight_for_negative_outcomes=linear_in_log_odds_prob_weight,utility=tk_1992_utility, digits=4)
## cid gid ev pt ce rp
## 1 1 1 2409 881.3 2222 187
## 2 1 2 2400 943.2 2400 -0.000000000001819
## 3 2 1 825 312.6 684.2 140.8
## 4 2 2 816 307.2 670.9 145.1
The comparePT comands hase the pt value as output but also quite a lot of other values. However, I would like to only have the pt value as output, is this somehow possible? I looked into the package but could not find the formula in there, unfortunately.
Appears pt is taken off CRAN, but can be installed from the github archive:
library(devtools)
install_github("cran/pt")
comparePT() is an S4 function. Inspecting these are a little different from the regular S3 kind. First you use showMethods() to see the available methods, before you use getMethod() for the method you are interested in.
showMethods("comparePT")
# Function: comparePT (package pt)
# object="Choices"
getMethod("comparePT", "Choices")
# Method Definition:
#
# function ...
However, the output from comparePT() is just a regular data.frame, so you can subset it by using $ as normal. And wrap as.numeric() around it, as its coded as character.
as.numeric(comparePT(my_choices, linear_in_log_odds_prob_weight,
linear_in_log_odds_prob_weight, tk_1992_utility, 4)$pt)
# [1] 881.3 943.2 312.6 307.2
I want to find the F value (of a one(right)-tailed distribution) that will correspond to p=0.05 given pre-specified two degrees of freedom (1 and 29 below). I do this by trial and error:
#..F values..............p values...
1-pf(4.75, 1, 29) # 0.03756451
1-pf(4.15, 1, 29) # 0.05085273
1-pf(4.18295, 1, 29) # 0.05000037
1-pf(4.18297, 1, 29) # 0.04999985
1-pf(4.18296, 1, 29) # 0.05000011
So, I want to obtain F=4.18296 without trial and error. Any idea?
There are two possibilities to achieve such result, we need to use the quantile function:
qf(1 - 0.05, 1, 29) or qf(0.05, 1, 29, lower.tail = FALSE)
qf(1 - 0.05, 1, 29)
# [1] 4.182964
qf(0.05, 1, 29, lower.tail = FALSE)
# [1] 4.182964
1 - pf(4.182964, 1, 29)
# [1] 0.05000001
The first option takes into account that the default option of lower.tail is equal to TRUE so we have to use 1 - 0.05
For the second option, we specify that we want P[X > x] using lower.tail = FALSE
I would like to change the default step-pattern weight of the cost function because I need to standardize my results with some others in a paper that doesn't use the weight 2 for the diagonal distance. I've read the JSS paper but I just found other step-patterns that are not what I'm really looking for, I guess. For example, imagine we have two timeSeries Q, C:
Q = array(c(0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0),dim=c(8,2))
C = array(c(0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0),dim=c(8,2))
When I calculate the dtw distance, I obtain
alignment = dtw(Q,C,keep=TRUE)
With a alginment$distance of 2.41 and a cost matrix where for example the [2,2] element is 2 instead of 1 because of the weight or penalization of 2*d[i,j] in the diagonal when selecting the minimum between:
g[i,j] = min( g[i-1,j-1] + 2 * d[i ,j ] ,
g[i ,j-1] + d[i ,j ] ,
g[i-1,j ] + d[i ,j ] )
plot(asymmetricP1)
edit(asymmetricP1)
structure(c(1, 1, 1, 2, 2, 3, 3, 3, 1, 0, 0, 1, 0, 2, 1, 0, 2,
1, 0, 1, 0, 1, 0, 0, -1, 0.5, 0.5, -1, 1, -1, 1, 1), .Dim = c(8L, 4L), class = "stepPattern", npat = 3, norm = "N")
Look at the plot, and consider the branches as ordered from right to left (ie. branch1 = 0.5 weight)
Everything in the script below is in the context of plot(asymmetricP1) and edit(asymmetricP1)
#first 8 digit sequence (1,1,1,2,2,3,3,3....
#branch1: "1,1,1" <- amount of intervals assigned to specificaly branch1; (end, joint, origin)
#branch2: "2,2" <- only 2 intervals, this is the middle diagnol line.
#branch3: "3,3,3" <- amount of interals
#note: Don't be confused by the numbers themselves, ie. "4,4,4" <- 3 intervals; "2,2,2" <- 3 intervals
#for the next sequences consider:
#the sequence of each branch is to be read as farthest from origin -> 0,0
#each interval assignment is accounted for in this order
#next 8 digit sequence: 1, 0, 0, 1, 0, 2, 1, 0,
#branch1: 1,0,0 <- interval position in relation to the query index
#branch2: 1,0 <- interval position in relation to the query index
#branch3: 2,1,0 <- interval position in relation to the query index (again see in plot)
#next 8 digit sequence: 2, 1, 0, 1, 0, 1, 0, 0
#branch1: 2,1,0 <- interval position in relation to the REFERENCE index
#branch2: 1,0 <- interval position in relation to the reference index
#branch3: 1,0,0 <- interval position in relation to the reference index (again see in plot)
#next 8 digit sequence: -1, 0.5, 0.5, -1, 1, -1, 1, 1
#note: "-1" is a signal that indicates weighting values follow
#note: notice that for each -1 that occurs, there is one value less, for example branch 1
# .....which has 3 intervals can only contain 2 weights (0.5 and 0.5)
#branch1: -1,0.5,0.5 <- changing the first 0.5 changes weight of [-1:0] segment (query index)
#branch2: -1,1 <- weight of middle branch
#branch3: -1,1,1 <- changing the second 1 changes weight of[-1,0] segment (query index)
#.Dim=c(8L, 4L):
#8 represents the number of intervals (1,1,1,2,2,3,3,3)
#4 (from what I understand) is the (length of all the branch sequences mentioned previously)/8
#npat = 3
#3 is the number of patterns you described in the structure. ie(1,1,1,2,2,3,3,3)
Hope this helps, good luck!
I am confused with the return of function get.basis(). For example,
lprec <- make.lp(0, 4)
set.objfn(lprec, c(1, 3, 6.24, 0.1))
add.constraint(lprec, c(0, 78.26, 0, 2.9), ">=", 92.3)
add.constraint(lprec, c(0.24, 0, 11.31, 0), "<=", 14.8)
add.constraint(lprec, c(12.68, 0, 0.08, 0.9), ">=", 4)
set.bounds(lprec, lower = c(28.6, 18), columns = c(1, 4))
set.bounds(lprec, upper = 48.98, columns = 4)
RowNames <- c("THISROW", "THATROW", "LASTROW")
ColNames <- c("COLONE", "COLTWO", "COLTHREE", "COLFOUR")
dimnames(lprec) <- list(RowNames, ColNames)
solve(lprec)
Then the basic variables are
> get.basis(lprec)
[1] -7 -2 -3
However, the solution is
> get.variables(lprec)
[1] 28.60000 0.00000 0.00000 31.82759
From the solution, it seems variable 1 and variable 4 are basis. Hence how does vector (-7, -2, -3) come from?
I am guessing it is from 3 constraints and 4 decision variables.
After I reviewed the simplex method for bounded variables, finally I understood how it happens. These two links are helpful. Example; Video
Come back to this problem, the structure is like
lpSolveAPI (R interface for lp_solve) would rewrite the constraint structure as the following format after adding appropriate slack variables. The first three columns are for slack variables. Hence, the return of get.basis(), which is -7,-2,-3, are column 7, 2, 3 that represent variable 4, slack variable 2 and 3.
With respect to this kind of LP with bounded variables, a variable could be nonbasic at either lower bound or upper bound. The return of get.basis(lp, nonbasic=TRUE) is -1,-4,-5,-6. Minus means these variables are at their lower bound. It means slack variable 1 = 0, variable 4 = 28.6, variable 5 = 0, variable 6 = 0.
Thus, the optimal solution is 28.6(nonbasic), 0(nonbasic), 0(nonbasic), 31.82(basic)