In R, I am trying to optimize the following : choose rows which maximize the number of columns whose sum exceeds a certain value which varies by column + some other basic constraints on the row selections.
Is there anything out there in R which allows you to incorporate logic into an objective function? ie maximize countif ( sum(value column) > target value for column ) over ~10k columns choosing 5 rows with ~ 500 row choices.
Simple example: Grab the combo of 4 rows below whose col sum exceeds the target more frequently than any other combo of 4 rows.
+--------+------+------+------+------+------+------+------+------+------+-------+
| x | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 |
+--------+------+------+------+------+------+------+------+------+------+-------+
| row1 | 82 | 73 | 50 | 11 | 76 | 12 | 46 | 64 | 5 | 44 |
| row2 | 2 | 33 | 35 | 55 | 52 | 18 | 13 | 86 | 72 | 39 |
| row3 | 94 | 5 | 10 | 21 | 90 | 62 | 54 | 54 | 7 | 17 |
| row4 | 27 | 10 | 28 | 87 | 27 | 83 | 62 | 56 | 54 | 86 |
| row5 | 17 | 50 | 34 | 30 | 80 | 7 | 96 | 91 | 32 | 21 |
| row6 | 73 | 75 | 32 | 71 | 37 | 1 | 13 | 76 | 10 | 34 |
| row7 | 98 | 13 | 87 | 49 | 27 | 90 | 28 | 75 | 55 | 21 |
| row8 | 45 | 54 | 25 | 1 | 3 | 75 | 84 | 76 | 9 | 87 |
| row9 | 40 | 87 | 44 | 20 | 97 | 28 | 88 | 14 | 66 | 77 |
| row10 | 18 | 28 | 21 | 35 | 22 | 9 | 37 | 58 | 82 | 97 |
| target | 200 | 100 | 125 | 135| 250 | 89 | 109 | 210| 184 | 178 |
+--------+------+------+------+------+------+------+------+------+------+-------+
EDIT + Update: I implemented the following using ompr, ROI, and some Big M logic.
nr <- 10 # number of rows
nt <- 15 # number of target columns
vals <- matrix(sample.int(nr*nt, nr*nt), nrow=nr, ncol=nt)
targets <- vector(length=nt)
targets[1:nt] <- 4*mean(vals)
model <- MIPModel() %>%
add_variable(x[i], i = 1:nr, type = "binary") %>%
add_constraint(sum_expr(x[i], i = 1:nr)==4)%>%
add_variable(A[j], j = 1:nt, type = "binary") %>%
add_variable(s[j], j = 1:nt, type = "continuous",lb=0) %>%
add_constraint(s[j] <= 9999999*A[j], j =1:nt)%>%
add_constraint(s[j] >= A[j], j =1:nt)%>%
add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + A[j] + s[j] >= targets[j], j=1:nt) %>%
set_objective(sum_expr(-9999999*A[j], i = 1:nr, j = 1:nt), "max")
model <- solve_model(model,with_ROI(solver = "glpk"))
The model works great for small problems, including those where no solution exists which exceeds the target of every column.
However, the above returns Infeasible when I change the number of columns to even just 150. Given that I tested various scenarios on my smaller example, my hunch is that my model definition is OK...
Any suggestions as to why this is infeasible? Or maybe a more optimal way to define my model?
You could try a Local-Search algorithm. It may give you only a "good" solution; but in exchange it is highly flexible.
Here is a sketch. Start with an arbitrary valid solution x, for instance
for your example data
x <- c(rep(TRUE, 4), rep(FALSE, 6))
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Define an objective function:
obj_fun <- function(x, table, target, ...) {
-sum(colSums(table[x, ]) >= target)
}
Given a table and a target vector, it selects the rows
defined in x and calculates the number of rowsums that
match or exceed the target. I write -sum
because I'll use an implementation that minimises an
objective function.
-obj_fun(x, table, target)
## [1] 7
So, for the chosen initial solution, 7 column sums are equal to or greater than the target.
Then you'll need a neighbourhood function. It takes a
solution x and returns a slightly changed version (a
"neighbour" of the original x). Here is a neighbour function
that changes a single row in x.
nb <- function(x, ...) {
true <- which( x)
false <- which(!x)
i <- true[sample.int(length( true), size = 1)]
j <- false[sample.int(length(false), size = 1)]
x[i] <- FALSE
x[j] <- TRUE
x
}
x
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
nb(x)
## [1] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## ^^^^^ ^^^^
Here is your data:
library("orgutils")
tt <- readOrg(text = "
| x | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 |
|--------+------+------+------+------+------+------+------+------+------+-------+
| row1 | 82 | 73 | 50 | 11 | 76 | 12 | 46 | 64 | 5 | 44 |
| row2 | 2 | 33 | 35 | 55 | 52 | 18 | 13 | 86 | 72 | 39 |
| row3 | 94 | 5 | 10 | 21 | 90 | 62 | 54 | 54 | 7 | 17 |
| row4 | 27 | 10 | 28 | 87 | 27 | 83 | 62 | 56 | 54 | 86 |
| row5 | 17 | 50 | 34 | 30 | 80 | 7 | 96 | 91 | 32 | 21 |
| row6 | 73 | 75 | 32 | 71 | 37 | 1 | 13 | 76 | 10 | 34 |
| row7 | 98 | 13 | 87 | 49 | 27 | 90 | 28 | 75 | 55 | 21 |
| row8 | 45 | 54 | 25 | 1 | 3 | 75 | 84 | 76 | 9 | 87 |
| row9 | 40 | 87 | 44 | 20 | 97 | 28 | 88 | 14 | 66 | 77 |
| row10 | 18 | 28 | 21 | 35 | 22 | 9 | 37 | 58 | 82 | 97 |
| target | 200 | 100 | 125 | 135| 250 | 89 | 109 | 210| 184 | 178 |
")
table <- tt[1:10, -1]
target <- tt[11, -1]
Run the search; in this case, with an algorithm called
"Threshold Accepting". I use the implementation in package NMOF (which I maintain).
library("NMOF")
x0 <- c(rep(TRUE, 4), rep(FALSE, 6))
sol <- TAopt(obj_fun,
list(neighbour = nb, ## neighbourhood fun
x0 = sample(x0), ## initial solution
nI = 1000, ## iterations
OF.target = -ncol(target) ## when to stop
),
target = target,
table = as.matrix(table))
rbind(Sums = colSums(table[sol$xbest, ]), Target = target)
## col1 col2 col3 col4 col5 col6 col7 col8 col9 col10
## Sums 222 206 216 135 252 148 175 239 198 181
## Target 200 100 125 135 250 89 109 210 184 178
As I said, this is a only sketch, and depending on how
large and important your actual problem is, there are a number
of points to consider:
most importantly: nI sets the number of search
iterations. 1000 is the default, but you'll definitely
want to play around with this number.
there may be cases (i.e. datasets) for which the
objective function does not provide good guidance: if
selecting different rows does not change the number
of columns for which the target is met, the algorithm
cannot judge if a new solution is better than the
previous one. Thus, adding more-continuous guidance
(e.g. via some distance-to-target) may help.
updating: the computation above actually does a lot
that's not necessary. When a new candidate solution
is evaluated, there would actually be no need to
recompute the full column sums. Instead, only adjust
the previous solution's sums by the changed
rows. (For a small dataset, this won't matter much.)
This isn't quite what you asked as it is cast in python, but perhaps it will show you the approach to doing this with Integer Programming. You should be able to replicate this in R as there are bindings in R for several solvers, including CBC, which is the one I'm using below, which is suitable for Integer Programs.
I'm also using pyomo to frame up the math model for the solver. I think with a little research, you could find equivalent way to do this in R. The syntax at the start is just to ingest the data (which I just pasted into a .csv file). The rest should be readable.
The good/bad...
This solves almost immediately for your toy problem. It can be shown that 5 rows can exceed all column totals.
For many more columns, it can bog down greatly. I did a couple tests with large matrices of random numbers.... This is very challenging for the solver because it cannot identify "good" rows easily. I can get it to solve for 500x100 with random values (and the total row randomized and multiplied by 5 (the number of selections....just to make it challenging) only in reasonable time by relaxing the tolerance on the solution.
If you really have 10K columns, there are only a couple ways this could work... 1. You have several rows that can cover all the column totals (solver should discover this quickly) or 2. there is some pattern (other than random noise) to the data/totals that can guide the solver, and 3. Using a large ratio based gap (or time limit)
import pyomo.environ as pyo
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv", header=None) # this is the data from the post
# uncomment this below for a randomized set of data
# df = pd.DataFrame(
# data = np.random.random(size=(500,100)))
# df.iloc[-1] = df.iloc[-1]*5
# convert to dictionary
data = df.iloc[:len(df)-1].stack().to_dict()
col_sums = df.iloc[len(df)-1].to_dict()
limit = 5 # max number or rows selected
m = pyo.ConcreteModel('row picker')
### SETS
m.R = pyo.Set(initialize=range(len(df)-1))
m.C = pyo.Set(initialize=range(len(df.columns)))
### Params
m.val = pyo.Param(m.R, m.C, initialize=data)
m.tots = pyo.Param(m.C, initialize=col_sums)
### Variables
m.sel = pyo.Var(m.R, domain=pyo.Binary) # indicator for which rows are selected
m.abv = pyo.Var(m.C, domain=pyo.Binary) # indicator for which column is above total
### OBJECTIVE
m.obj = pyo.Objective(expr=sum(m.abv[c] for c in m.C), sense=pyo.maximize)
### CONSTRAINTS
# limit the total number of selections...
m.sel_limit = pyo.Constraint(expr=sum(m.sel[r] for r in m.R) <= limit)
# link the indicator variable to the column sum
def c_sum(m, c):
return sum(m.val[r, c] * m.sel[r] for r in m.R) >= m.tots[c] * m.abv[c]
m.col_sum = pyo.Constraint(m.C, rule=c_sum)
### SOLVE
print("...built... solving...")
solver = pyo.SolverFactory('cbc', options={'ratio': 0.05})
result = solver.solve(m)
print(result)
### Inspect answer ...
print("rows to select: ")
for r in m.R:
if m.sel[r]:
print(r, end=', ')
print("\ncolumn sums from those rows")
tots = [sum(m.val[r,c]*m.sel[r].value for r in m.R) for c in m.C]
print(tots)
print(f'percentage of column totals exceeded: {len([1 for c in m.C if m.abv[c]])/len(m.C)*100:0.2f}%')
Yields:
Problem:
- Name: unknown
Lower bound: -10.0
Upper bound: -10.0
Number of objectives: 1
Number of constraints: 11
Number of variables: 20
Number of binary variables: 20
Number of integer variables: 20
Number of nonzeros: 10
Sense: maximize
Solver:
- Status: ok
User time: -1.0
System time: 0.0
Wallclock time: 0.0
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Black box:
Number of iterations: 0
Error rc: 0
Time: 0.013128995895385742
Solution:
- number of solutions: 0
number of solutions displayed: 0
rows to select:
0, 2, 3, 8, 9,
column sums from those rows
[261.0, 203.0, 153.0, 174.0, 312.0, 194.0, 287.0, 246.0, 214.0, 321.0]
percentage of column totals exceeded: 100.00%
[Finished in 845ms]
Edit:
I see your edit follows similar pattern to the above solution.
The reason you are getting "INFEASIBLE" for larger instantiations is that your Big-M is no longer big enough when the values are bigger and more are summed. You should pre-analyze your matrix and set BIG_M to be the maximal value in your target row, which will be big enough to cover any gap (by inspection). That will keep you feasible without massive overshoot on BIG_M which has consequences also.
I tweaked a few things on your r model. My r syntax is terrible, but try this out:
model <- MIPModel() %>%
add_variable(x[i], i = 1:nr, type = "binary") %>%
add_constraint(sum_expr(x[i], i = 1:nr)==4)%>%
add_variable(A[j], j = 1:nt, type = "binary") %>%
add_variable(s[j], j = 1:nt, type = "continuous",lb=0) %>%
add_constraint(s[j] <= BIG_M*A[j], j =1:nt)%>%
# NOT NEEDED: add_constraint(s[j] >= A[j], j =1:nt)%>%
# DON'T include A[j]: add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + A[j] + s[j] >= targets[j], j=1:nt) %>%
add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + s[j] >= targets[j], j=1:nt) %>%
# REMOVE unneded indexing for i: set_objective(sum_expr(A[j], i = 1:nr, j = 1:nt), "min")
# and just minimize. No need to multiply by a large constant here.
set_objective(sum_expr(A[j], j = 1:nt), "min")
model <- solve_model(model,with_ROI(solver = "glpk"))
This is IMHO a linear programming modeling question: Can we formulate the problem as a "normalized" linear problem that can be solved by, for example, ompr or ROI (I would add lpSolveAPI)?
I believe it is possible, though I do not have the time to provide the full formulation. Here are some ideas:
As parameters, i.e. fixed values, we have
nr <- 10 # number of rows
nt <- 10 # number of target columns
vals <- matrix(sample.int(100, nr*nt), nrow=nr, ncol=nt)
targets <- sample.int(300, nt)
The decision variables we are interested in are x[1...nr] as binary variables (1 if the row is picked, 0 otherwise).
Obviously, one constraint would be sum(x[i],i)==4 -- the numbers of rows we pick.
For the objective, I would introduce auxilliary variables, such as
y[j] = 1, if sum_{i=1..nr} x[i]*vals[i,j]>= targets[j]
(and 0 otherwise) for j=1...nt. Now this definition of y is not compatible with linear programming and needs to be linearized. If we can assume that val[i,j] and targets[j] are greater or equal to zero, then we can define y[j] as binary variables like this:
x'vals[,j]-t[j]*y[j] >= 0
(x'y is meant as inner product, i.e. sum(x[i]*y[i], i).)
In the case x'vals[,j]>=t[j], the value y[j]==1 is valid. In the case x'vals[,j]<t[j], y[j]==0 is enforced.
With the objective max sum(y[j],j), we should get a proper formulation of the problem. No big-M required. But additional assumptions on non-negativity introduced.
What you want to solve here is called a "mixed integer program", and there's lots of (mostly commercial) software designed around it.
Your typical R functions such as optim are hardly any good for it due to the kind of constraints, but you can use specialized software (such as CBC) as long as you are able to frame the problem in a standard MIP structure (in this case the variables to optimize are binary variables for each row in your data).
As an alternative, you could also look at the package nloptr with its global derivate-free black-box optimizers, in which you can enter a function like this (setting bounds on the variables) and let it optimize it with some general-purpose heuristics.
I'm trying to create two different subsets from a big data frame. Here's the code :
amelioration <- subset(resilience,
(subset(resilience, resilience$YEAR==2010)$SCORE>
subset(resilience, resilience$YEAR==2009)$SCORE))
The purpose is to select the observations which have a better score in 2010 than in 2009.
The problem is that in the new table (amelioration), the lines for 2009 and 2010 are missing but i want to keep them.
It should look like this :
ID | YEAR | SCORE
--------------------
177 | 2008 | 14
--------------------
177 | 2009 | 11
--------------------
177 | 2010 | 17
--------------------
177 | 2011 | 16
But it looks like this instead :
ID | YEAR | SCORE
--------------------
177 | 2008 | 14
--------------------
177 | 2011 | 16
I tried with the which command but it doesn't work either. What should I do ?
Using data.table you can recast your year variable over score, and filter directly on years.
ambivalence = data.table(ambivalence)
castedDF = dcast(ambivalence, id ~ year, fun.aggregate = sum)
targetRows = castedDF[`2010` > `2009`]
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
Let a and b be two numbers,
Number of divisors of (a) =n1;
Number of divisors of (b) =n2;
how to find the number of divisors of (a*b) by using n1 and n2?
a=15; b=20
n1=4; //no of divisors of a
n2=6; //no of divisors of b
ans=(a*b); //300
num_ans=18; // no of divisors of a*b
Is there any formula to find it?
You can't determine the number of divisors of the product of two numbers if all you know is the number of divisors of each number.
Example:
2 and 4 have two and three divisors, respectively. Their product, 8, has four divisors.
5 and 9 have two and three divisors, respectively. Their product, 45, has six divisors.
Both of these number pairs have the same number of individual divisors, but their products have different divisors.
No, but you can infer some information about the answer.
The bounds on the number of divisors of ans is [max(n1,n2),n1 * n2] (which is [6,24], for 20 and 21).
It's fairly easy to see how this comes about (at least for smaller numbers), by generating the divisors of 420 from the divisors of 20 and 21.
The divisors of 20 are the column headers, the divisors of 21 are the row headers. The cells contain the result of col_header * row_header for that row and column.
1 2 4 5 10 20
+----+----+----+-----+-----+-----|
1 | 1 | 2 | 4 | 5 | 10 | 20 |
3 | 3 | 6 | 12 | 15 | 30 | 60 |
7 | 7 | 14 | 28 | 35 | 70 | 140 |
21 | 21 | 42 | 84 | 105 | 210 | 420 |
Unfortunately, 20 and 21 are a special case, as they are relatively prime. Other combinations result in duplicate values in the cells.
For example, the table for 15 and 20 looks like this:
1 3 5 15
+----+----+-----+-----|
1 | 1 | 3 | 5 | 15 |
2 | 2 | 6 | 10 | 30 |
4 | 4 | 12 | 20 | 60 |
5 | 5 | 15 | 25 | 75 |
10 | 10 | 30 | 50 | 150 |
20 | 20 | 60 | 100 | 300 |
The numbers 5,10,15,20,30 and 60 appear multiple times, so we can't simply take the number of cells in the table as the number of divisors. The number of unique values in the table does equal the number of divisors (18), which falls within the interval [6,24].
This can give you a ball-park estimate on the complexity of getting the answer, the algorithm for which is essentially generating all of the divisors, discarding duplicates.
I have a table of values that I want to rotate the values of.
For instance, suppose I have the given (sub) table:
id | value | other columns...
---+-------+---
1 | A | ...
29 | B |
37 | C |
56 | D |
I want to rotate the values so that the final state of the table is:
id | value | other columns...
---+-------+---
1 | B | ...
29 | C |
37 | D |
56 | A |
I know all the ids which should take part in the rotate. However, I'm not sure how I would use this knowledge to update the values. Something like this is not valid SQL:
UPDATE table SET value=('B','C','D','A') WHERE id IN (1,29,37,56)
You could use a CASE expression:
UPDATE MyTable
SET value = CASE id
WHEN 1 THEN 'B'
WHEN 29 THEN 'C'
WHEN 37 THEN 'D'
WHEN 56 THEN 'A'
END
WHERE id IN (1, 29, 37, 56);