Linear Programming - Unique number of categories - r

TL;DR: I am trying to find the "cheapest" set of items in a collection that satisfy certain linear constraints. However, every element can be part of multiple "categories" and I also want to have a mix of those unique categories and I'm not quite sure if this can be implemented in a LP way or not and in case how to approach it.
Example - Part 1
Let's say I have 7 items that have different costs and different values associated to them.
library(tidyverse)
library(lpSolve)
# Fake data
kd = tibble(
Item = 1:7,
Cost = c(1, 1, 1, 1, 2, 3, 4),
Value =c(1, 1, 3, 4, 6, 3, 2),
Type = c("A", "A", "A", "B", "C", "D", "E")
)
I want to pick 3 of those elements so that Cost is minimized and their Value is >= 5. I can easily do this with lp with the following code:
# Objective function
knapsack.obj = kd$Cost
# Constraints
knapsack.con = matrix(
c(
rep(1, nrow(kd)),
kd$Value
),
nrow = 2, byrow = TRUE
)
knapsack.dir = c("==", ">=")
knapsack.rhs = c(3, 5)
# Solve
knapsackSolution = lp("min", knapsack.obj, knapsack.con, knapsack.dir, knapsack.rhs, all.bin = TRUE)
# Results
kd[knapsackSolution$solution == 1, ]
As expected this returns Item 1, 2 and 3 that have a combined Value=5 and are obviously minimizing the price.
Example - Part 2
The extra complication I don't quite know how to solve now is adding code for making sure the items picked they come from at least 2 unique categories.
Now the solution I'd expect is Item 1, 2 and 4 (or 1, 3 and 4) that have still a combined cost of 3 and a value of 6 (or 8) that is >= 5 but are not all "A" elements but contain also Item 4 that is a "B" element.
Any idea on how to implement this in a LP framework?

Mathematical Model
If we introduce a zero-one (data) matrix
Category[i,j] = 1 if item i has type j
0 otherwise
and a binary variable:
y[j] = 1 if an item with type j is selected
0 otherwise
we can develop a simple mathematical model:
The blue colored symbols represent data while the red ones are decision variables.
Note that the variable y[j] can be relaxed to be continuous between 0 and 1.
The advantage of first writing down a mathematical model is that it is easier to reason about than a bunch of R code (at least for me).
Implementation
I use OMPR here for two reasons:
Direct way to implement the model in an equation based fashion. We stay closer to the mathematical model.
Access to better solvers than LpSolve.
Here is the R code:
library(tidyverse)
library(ROI)
library(ROI.plugin.symphony)
library(ompr)
library(ompr.roi)
# Fake data
kd = tibble(
Item = 1:7,
Cost = c(1, 1, 1, 1, 2, 3, 4),
Value =c(1, 1, 3, 4, 6, 3, 2),
Type = c("A", "A", "A", "B", "C", "D", "E")
)
Types <- c("A","B","C","D","E")
Category <- 1*outer(kd$Type,Types,FUN="==")
Type <- 1:length(Types)
numItems <- 3
MinValue <- 5
MinItems <- 2
m <- MIPModel() %>%
add_variable(x[i], i=kd$Item, type="binary") %>%
add_variable(y[j], j=Type, type="binary") %>%
add_constraint(sum_expr(x[i], i=kd$Item) == numItems) %>%
add_constraint(sum_expr(kd$Value[i]*x[i], i=kd$Item) >= MinValue) %>%
add_constraint(y[j] <= sum_expr(Category[i,j]*x[i], i=kd$Item), j=Type) %>%
add_constraint(sum_expr(y[j], j=Type) >= MinItems) %>%
set_objective(sum_expr(kd$Cost[i]*x[i], i=kd$Item),"min") %>%
solve_model(with_ROI(solver = "symphony", verbosity=1))
cat("Status:",solver_status(m),"\n")
cat("Objective:",objective_value(m),"\n")
m$solution
Probably the most complex part here is the calculation of the Category matrix.
Solution
The solution looks like:
Status: optimal
Objective: 3
x[1] x[2] x[3] x[4] x[5] x[6] x[7] y[1] y[2] y[3] y[4] y[5]
1 1 0 1 0 0 0 1 1 0 0 0

Actually, we do not have to force the solution to have k-1 or fewer elements from each group. We can instead force each group to have at most g_i-1 elements where g_i is the number of elements in each group.
Here is the implementation:
library(purrr)
library(lpSolve)
library(fastmatch)
# Fake data
kd = tibble(
Item = 1:7,
Cost = c(1, 1, 1, 1, 2, 3, 4),
Value =c(1, 1, 3, 4, 6, 3, 2),
Type = c("A", "A", "A", "B", "C", "D", "E")
)
# number of elements to choose
k = 3
type_match <- fmatch(kd$Type, unique(kd$Type))
unique_cat <- unique(type_match)
add_con <- map(unique_cat,function(x) {
type_match[type_match != x] = 0
type_match[type_match > 0] = 1
return(type_match)}) %>%
do.call(rbind,.)
knapsack.obj = kd$Cost
knapsack.con =
rbind(
rep(1, nrow(kd)),
kd$Value,
add_con
)
rhs_add <- apply(add_con, 1, function(x) ifelse(sum(x)>1,sum(x) - 1,1))
knapsack.dir = c("==", ">=", rep("<=",length(rhs_add)))
knapsack.rhs = c(k, 5, rhs_add)
knapsackSolution = lp("min",
knapsack.obj,
knapsack.con,
knapsack.dir,
knapsack.rhs,
all.bin = TRUE)
knapsackSolution$solution
> knapsackSolution$solution
[1] 1 1 0 1 0 0 0

Since we know that the solution must have k = 3 elements restrict each group to have k-1 or fewer elements forcing at least 2 groups to be used.
incid <- +outer(unique(kd$Type), kd$Type, "==")
ntypes <- nrow(incid)
knapsack.con = rbind(
rep(1, nrow(kd)),
kd$Value,
incid)
k <- 3
knapsack.dir = c("==", ">=", rep("<=", ntypes))
knapsack.rhs = c(k, 5, rep(k-1, ntypes))
res <- lp("min", knapsack.obj, knapsack.con, knapsack.dir, knapsack.rhs, all.bin = TRUE)
res$status
## [1] 0
res$solution
## [1] 1 1 0 1 0 0 0
Simplification
As we discussed in the comments, for this particular data we can omit the last 4 constraints since they are always satsified as there is only one element in each of the last 4 groups.
res2 <- lp("min", knapsack.obj, knapsack.con[1:3, ], knapsack.dir[1:3],
disknapsack.rhs[1:3], all.bin = TRUE)
res2$status
## [1] 0
res2$solution
## [1] 1 1 0 1 0 0 0
Generalizing
As we discussed in the comments, to generalize let us suppose we want at least 3 different categories in the solution rather than 2. In this particular data we could simply require the solution have no more than 1 of each category but in general that will not work so let us take all combinations of groups 2 at a time and produce the constraints shown below. 5 is the total number of categories in the problem and 2 is one less than the number of categories required to be in the solution.
combos <- combn(5, 2, function(x) colSums(incid[x, ]))
For each of those constraints, i.e. each row in combos, we require that it be less or equal to 2 in order to exclude any solution having only 1 or 2 categories. We then construct the LP in a similar manner as before adding the remaining constraints.

Related

R Sort or order with custom compare function

Can I pass a custom compare function to order that, given two items, indicates which one is ranked higher?
In my specific case I have the following list.
scores <- list(
'a' = c(1, 1, 2, 3, 4, 4),
'b' = c(1, 2, 2, 2, 3, 4),
'c' = c(1, 1, 2, 2, 3, 4),
'd' = c(1, 2, 3, 3, 3, 4)
)
If we take two vectors a and b, the index of the first element i at which a[i] > b[i] or a[i] < b[i] should determine what vector comes first. In this example, scores[['d']] > scores[['a']] because scores[['d']][2] > scores[['a']][2] (note that it doesn't matter that scores[['d']][5] < scores[['a']][5]).
Comparing two of those vectors could look something like this.
compare <- function(a, b) {
# get first element index at which vectors differ
i <- which.max(a != b)
if(a[i] > b[i])
1
else if(a[i] < b[i])
-1
else
0
}
The sorted keys of scores by using this comparison function should then be d, b, a, c.
From other solutions I've found, they mess with the data before ordering or introduce S3 classes and apply comparison attributes. With the former I fail to see how to mess with my data (maybe turn it into strings? But then what about numbers above 9?), with the latter I feel uncomfortable introducing a new class into my R package only for comparing vectors. And there doesn't seem to be a sort of comparator parameter I'd want to pass to order.
Here's an attempt. I've explained every step in the comments.
compare <- function(a, b) {
# subtract vector a from vector b
comparison <- a - b
# get the first non-zero result
restult <- comparison[comparison != 0][1]
# return 1 if result == 1 and 2 if result == -1 (0 if equal)
if(is.na(restult)) {return(0)} else if(restult == 1) {return(1)} else {return(2)}
}
compare_list <- function(list_) {
# get combinations of all possible comparison
comparisons <- combn(length(list_), 2)
# compare all possibilities
results <- apply(comparisons, 2, function(x) {
# get the "winner"
x[compare(list_[[x[1]]], list_[[x[2]]])]
})
# get frequency table (how often a vector "won" -> this is the result you want)
fr_tab <- table(results)
# vector that is last in comparison
last_vector <- which(!(1:length(list_) %in% as.numeric(names(fr_tab))))
# return the sorted results and add the last vectors name
c(as.numeric(names(sort(fr_tab, decreasing = T))), last_vector)
}
If you run the function on your example, the result is
> compare_list(scores)
[1] 4 2 1 3
I haven't dealt with the case that the two vectors are identical, you haven't explained how to deal with this.
The native R way to do this is to introduce an S3 class.
There are two things you can do with the class. You can define a method for xtfrm that converts your list entries to numbers. That could be vectorized, and conceivably could be really fast.
But you were asking for a user defined compare function. This is going to be slow because R function calls are slow, and it's a little clumsy because nobody does it. But following the instructions in the xtfrm help page, here's how to do it:
scores <- list(
'a' = c(1, 1, 2, 3, 4, 4),
'b' = c(1, 2, 2, 2, 3, 4),
'c' = c(1, 1, 2, 2, 3, 4),
'd' = c(1, 2, 3, 3, 3, 4)
)
# Add a class to the list
scores <- structure(scores, class = "lexico")
# Need to keep the class when subsetting
`[.lexico` <- function(x, i, ...) structure(unclass(x)[i], class = "lexico")
# Careful here: identical() might be too strict
`==.lexico` <- function(a, b) {identical(a, b)}
`>.lexico` <- function(a, b) {
a <- a[[1]]
b <- b[[1]]
i <- which(a != b)
length(i) > 0 && a[i[1]] > b[i[1]]
}
is.na.lexico <- function(a) FALSE
sort(scores)
#> $c
#> [1] 1 1 2 2 3 4
#>
#> $a
#> [1] 1 1 2 3 4 4
#>
#> $b
#> [1] 1 2 2 2 3 4
#>
#> $d
#> [1] 1 2 3 3 3 4
#>
#> attr(,"class")
#> [1] "lexico"
Created on 2021-11-27 by the reprex package (v2.0.1)
This is the opposite of the order you asked for, because by default sort() sorts to increasing order. If you really want d, b, a, c use sort(scores, decreasing = TRUE.
Here's another, very simple solution:
sort(sapply(scores, function(x) as.numeric(paste(x, collapse = ""))), decreasing = T)
What it does is, it takes all the the vectors, "compresses" them into a single numerical digit and then sorts those numbers in decreasing order.

What solves my problem: Map, reduce or a recursion?

I really need some help to write a recursion in R.
The function that I want changes a certain observation according to a set of comparisons between different rows in a data frame, which I shall call g. One of these comparisons depends on the previous value of this same observation.
Suppose first that I want to update the value of column index, row i in my data df in the following way:
j <- 1:4
g <- (df$dom[i] > 0 &
abs(df$V2009[i] - df$V2009[j]) <= w) |
df$index[i] == df$index[j]
df$index[i] <- ifelse(any(g), which(g)[[1]], df$index[[i]])
The thing is, the object w is actually a list:
w = list(0, 1, 2, df$age[i])
So, as you can see, I want to create a function foo() that updates df$index iteratively. It changes it by looping through w and comparisons depend on updated values.
Here is some data:
df <- data.frame(dom = c(0, 0, 6, 6),
V2009 = c(9, 11, 9, 11),
index = c(1, 2, 1, 2),
age = c(2, 2, 2, 2))
I am not sure if a recursive function is actually needed or if something like reduce or map would do it.
Thank you!
The following function uses a double for loop to change the values of column index according to the condition defining g. It accepts a data.frame as input and returns the updated data.frame.
foo <- function(x){
change_index <- function(x, i, w){
j <- seq_len(nrow(x))
(x$dom[i] > 0 & abs(x$V2009[i] - x$V2009[j]) <= w) |
x$index[i] == x$index[j]
}
for(i in seq_len(nrow(x))){
W <- list(0, 1, 2, x$age[i])
for(w in W){
g <- change_index(x, i, w)
if(any(g)) x$index[i] <- which(g)[1]
}
}
x
}
foo(df)
# dom V2009 index age
#1 0 9 1 2
#2 0 11 2 2
#3 6 9 1 2
#4 6 11 1 2
One can define w inside a function and use lexical scoping (closure).
Using your instructions, the function index_value calculates for any given i the index value.
correct_index_col returns the corrected df.
df <- data.frame(dom = c(0, 0, 6, 6),
V2009 = c(9, 11, 9, 11),
index = c(1, 2, 1, 2),
age = c(2, 2, 2, 2))
index_value <- function(df, i) {
j <- nrow(df)
w <- c(0, 1, 2, df$age[i])
g <- (df$dom[i] > 0 & abs(df$V2009[i] - df$V2009[j]) <= w) |
df$index[i] == df$index[j]
ifelse(any(g), which(g)[[1]], df$index[[i]])
}
correct_index_col <- function(df) {
indexes <- Vectorize(function(i) {
index_value(df, i)
})
df$index <- indexes(1:nrow(df))
df
}
# > correct_index_col(df)
# dom V2009 index age
# 1 0 9 1 2
# 2 0 11 1 2
# 3 6 9 3 2
# 4 6 11 1 2
#
If you want to really update (mutate) your df, then you have to do
df <- correct_index_col(df).
Here is an attempt of my own. I guess I figured out a way to use recursion over mutate:
test <- function(i, df, k){
j <- 1:nrow(df)
w <- list(0, 1, 2, df$age[i])
g <- (df$dom[i] > 0 & abs(df$V2009[i] - df$V2009[j]) <= w[k]) |
df$index[i] == df$index[j]
l <- ifelse(any(g), which(g)[1], df$index[i])
return(l)
}
loop <- function(data,
k = 1) {
data <- data %>%
mutate(index = map_dbl(seq(n()),
~ test(.x, df = cur_data(), k)))
if (k == 4) {
return(data)
} else {
return(loop(data, k + 1))
}
}
df %>% loop()
I welcome any comments in case this is inefficient considering large datasets

Multivariate cummulative sum

Assume one wished to calculate a cumulative sum based on a multivariate condition, all(Z[i] <= x), for all i over a multivariate grid x. One may obviously implement this naively
cSums <- numeric(nrow(x))
for(i in seq(nrow(x))){
for(j in seq(nrow(Z))){
if(all(Z[j, ] <= x[i, ]))
cSums[i] <- cSums[i] + R[j] # <== R is a single vector to be summed
}
}
which would be somewhere around O((n*p)^2), or slightly faster by iteratively subsetting the columns
cSums <- numeric(nrow(x))
for(i in seq(nrow(x))){
indx <- seq(nrow(Z))
for(j in seq(ncol(Z))){
indx <- indx[which(Z[indx, j] <= x[i, j])]
}
cSums[i] <- sum(R[indx])
}
but this still worst-case as slow as the naive-implementation. How could one improve this to achieve faster performance, while still allowing an undefined number of columns to be compared?
Dummy data and Reproducible example
var1 <- c(3,3,3,5,5,5,4,4,4,6)
var2 <- rep(seq(1,5), each = 2)
Z <- cbind(var1, var2)
x <- Z
R <- rep(1, nrow(x))
# Result using either method.
#[1] 2 2 3 4 6 6 5 5 6 10
outer is your friend, just Vectorize your comparison. colSums yields the desired result then. Should be fast.
f <- Vectorize(function(k, l) all(Z[k, ] <= x[l, ]))
res <- colSums(outer(1:nrow(Z), 1:nrow(x), f))
res
# [1] 2 2 3 4 6 6 5 5 6 10
Data
x <- Z <- structure(c(3, 3, 3, 5, 5, 5, 4, 4, 4, 6, 1, 1, 2, 2, 3, 3, 4,
4, 5, 5), .Dim = c(10L, 2L), .Dimnames = list(NULL, c("var1",
"var2")))
We can use apply row-wise and compare every row with every other row and count how many of them satidy the criteria.
apply(Z, 1, function(x) sum(rowSums(Z <= as.list(x)) == length(x)))
#[1] 2 2 3 4 6 6 5 5 6 10
Similar approach can also be performed using sapply + split
sapply(split(Z, seq_len(nrow(Z))), function(x)
sum(rowSums(Z <= as.list(x)) == length(x)))
data
var1 <- c(3,3,3,5,5,5,4,4,4,6)
var2 <- rep(seq(1,5), each = 2)
Z <- data.frame(var1, var2)

Replacing values in vector by 0 except for sample and looping this

I have a vector, in this case "dist_SLA" for which I want to do the following:
I want to take samples of increasing sizes, from size = 1 until all values of "dist_SLA" are sampled (so size = 1, size = 2, size = 3, ..... size = "dist_SLA"). --> Ill call the sample vectors sample.i
Then I want to transform all the sample vectors "sample.i" to new ones using this method: The vector should be transformed so that all values from "dist_SLA" that were not sampled in sample.i are replaced by 0, so that it gives me a vector which includes the sampled values and zeros. I'll call the new vectors "sp.i"
Lastly, I want to make a list which combines all calculated R-squares of lm of all different transformed vectors "sp.i" and "dist_SLA" (So R-square of sp.1 with "dist_SLA" + R-square of sp.2 with "dist_SLA", etc)
I have tried the following:
dist_SLA <- c(1, 4, 9, 3, 4, 6)
for (i in 1:NROW(dist_SLA)){
sample_[i] <- sample(dist_SLA, size = i )
sp_[i] <- ifelse(dist_SLA == sample_[i], yes = sample_[i], no = "0")
lm_[i] <- lm(dist_SLA ~ sp_[i])
fit_[i] <- summary(lm_[i])$r.squared
}
But this gives me a few problems:
The "ifelse" function gives me a vector in which all values that are identical to the value(s) of the sample won't get replaced by 0 in "sp_1". I therefore want a vector in which only the sample value(s) is/are not replaced by 0 but the others are.
The loop does not work in this way but I cannot figure out how.
How can I fix this?
I believe the following does what you want.
Note that you don't need the sample.i vectors, only the r-squared values will be saved. so you only have to have a vector where to save them.
set.seed(3520) # Make the results reproducible
dist_SLA <- c(1, 4, 9, 3, 4, 6)
n <- length(dist_SLA)
fit <- numeric(length(dist_SLA))
for (i in seq_along(dist_SLA)){
smpl <- sample(n, size = i)
sp <- numeric(length(dist_SLA))
sp[smpl] <- dist_SLA[smpl]
lmi <- lm(dist_SLA ~ sp)
fit[i] <- summary(lmi)$r.squared
}
fit
#[1] 0.6480000 0.0200000 0.1739130 0.7667327 0.8711111 1.0000000
Try this:
set.seed(123)
sample_ <- sample(dist_SLA, size = 3)
sample_
[1] 4 3 6
dist_SLA <- c(1, 4, 9, 3, 4, 6)
Then this will give you
dist_SLA==sample_
[1] FALSE FALSE FALSE FALSE FALSE TRUE
Whereas using %in% gives:
dist_SLA %in% sample_
[1] FALSE TRUE FALSE TRUE TRUE TRUE
And
ifelse(dist_SLA %in% sample_, dist_SLA, 0)
[1] 0 4 0 3 4 6
So your loop, depending on what you want to save for later usage could look like
set.seed(123)
dist_SLA <- c(1, 4, 9, 3, 4, 6)
lm_ <- vector(mode = "list", length = length(dist_SLA))
fit_ <- vector(mode = "numeric", length = length(dist_SLA))
for(x in 1 : length(dist_SLA)){
sample_ <- sample(dist_SLA, size = x)
spi <- ifelse(dist_SLA %in% sample_, dist_SLA, 0)
lm_[[x]] <- lm(dist_SLA ~ spi)
fit_[x] <- summary(lm_[[x]])$r.squared
}

Variable sample upper value in R

I have the following matrix
m <- matrix(c(2, 4, 3, 5, 1, 5, 7, 9, 3, 7), nrow=5, ncol=2,)
colnames(x) = c("Y","Z")
m <-data.frame(m)
I am trying to create a random number in each row where the upper limit is a number based on a variable value (in this case 1*Y based on each row's value for for Z)
I currently have:
samp<-function(x){
sample(0:1,1,replace = TRUE)}
x$randoms <- apply(m,1,samp)
which work works well applying the sample function independently to each row, but I always get an error when I try to alter the x in sample. I thought I could do something like this:
samp<-function(x){
sample(0:m$Z,1,replace = TRUE)}
x$randoms <- apply(m,1,samp)
but I guess that was wishful thinking.
Ultimately I want the result:
Y Z randoms
2 5 4
4 7 7
3 9 3
5 3 1
1 7 6
Any ideas?
The following will sample from 0 to x$Y for each row, and store the result in randoms:
x$randoms <- sapply(x$Y + 1, sample, 1) - 1
Explanation:
The sapply takes each value in x$Y separately (let's call this y), and calls sample(y + 1, 1) on it.
Note that (e.g.) sample(y+1, 1) will sample 1 random integer from the range 1:(y+1). Since you want a number from 0 to y rather than 1 to y + 1, we subtract 1 at the end.
Also, just pointing out - no need for replace=T here because you are only sampling one value anyway, so it doesn't matter whether it gets replaced or not.
Based on #mathematical.coffee suggestion and my edited example this is the slick final result:
m <- matrix(c(2, 4, 3, 5, 1, 5, 7, 9, 3, 7), nrow=5, ncol=2,)
colnames(m) = c("Y","Z")
m <-data.frame(m)
samp<-function(x){
sample(Z + 1, 1)}
m$randoms <- sapply(m$Z + 1, sample, 1) - 1

Resources