Large mixed integer programming in R - possible to solve? - r

I would like to solve a large mixed integer programming problem, and I have tried with R, package lpSolveAPI. The problem is large - 410 variables each of which can be either 0 or 1, and about 49422 constraints. I have tried to let it run for 1,5 days, but once I try to stop it, it says that R needs to be terminated. The same happens if I let it run for short time, say 15 minutes, and then try to stop it by clicking on the red button. Since this happens, I am not sure whether there is something wrong with my computer or whether such problem is way too large for a computer. When it runs, it uses maybe 20% of the CPU power and about 70% of memory. My computer is a 2022 Lenovo X1 Yoga with i7 2.80GHz processor and 16GB of ram.
The problem itself is constructed in the following way:
library(lpSolveAPI)
#A has length of 410. No constraints to begin with
lprec <- make.lp(0, 410)
set.objfn(lprec, round(A,0))
lp.control(lprec, sense="max")
set.type(lprec,c(1:A),"binary")
#Defining the constraints with a for loop. Will not go into details, but it adds 49422 constraints
for (){
...
add.constraint(lprec, MyConstraint, "<=", 1)
...
}
lprec
#This says: Model name: a linear program with 410 decision variables and 49422 constraints
solve(lprec)
The vector "MyConstraint" is different in every iteration, but it has length 410 where 408 elements are 0 and two elements are 1.
That is, I have 410 objects in total, and I want to choose a set of those objects (1 if an object chosen and 0 otherwise) such that the objective function is maximized. However, some pairs of objects are not allowed, and so each of the 49422 constraints specifies which two objects cannot be chosen chosen at once: each constraint says that the sum cannot be above 1.
So, my question is if there is any way to solve this? If not, how large can such problem be in order to be solvable?
Thank you!
EDIT: ---------------------------------------------
In the comments I was asked to provide an example, so here is it. A similar, but much smaller problem. Suppose we have 7 different objects, and these can allocated into 5 groups. Let us define the groups and the associated savings denoted by A:
MyGroups <- c(1,0,0,0,1,0,0,
0,0,1,1,0,0,0,
0,0,1,0,0,1,0,
0,0,0,1,0,1,0,
0,0,1,1,0,1,0)
MyGroups <- matrix(MyGroups,nrow=5,ncol=7,byrow=TRUE)
rownames(MyGroups) <- paste0("Group", 1:5)
colnames(MyGroups) <- paste0("Object", 1:7)
A=c(50,30,100,100,200)
That is, group 1 consists of Object 1 and Object 5 (denoted by the first row in the matrix MyGroups). Such a group will give a saving of 50. Objective: to maximize the total saving by choosing the right groups. Problem: each object can only be a part of one group. For example, if group 2 is implemented, then group 3 cannot be implemented, since both groups require object 3. Here we see that the optimal solution is to choose Group 1 and Group 5, which will give a total saving of 50+200=250. I want to be able to find this for a bigger problem. So, first I can create a matrix with constraints where specifies which 2 groups cannot be implemented at the same time.
lprec2 <- make.lp(0, 5)
set.objfn(lprec2, A)
lp.control(lprec2, sense="max")
set.type(lprec2,c(1:5),"binary")
#Defining the constraints
for (i in 1:(5-1)){
for (j in (i+1):5) {
if(max(colSums(MyGroups[c(i,j),]))>1){
#group i and group j cannot be together. Add constraint
MyConstraint=integer(5)
MyConstraint[c(i,j)]=1
add.constraint(lprec2, MyConstraint, "<=", 1)
}
}
}
lprec2
This gives the following mixed integer problem:
When I solve it, then the solution is:
solve(lprec2)
get.objective(lprec2)
get.variables(lprec2)
Which gives 250 and (1 0 0 0 1) respectively.
In the original problem I have 410 possible groups, implying 410 decision variables. The number of constraints is 49422, but in all rows there are exactly two 1 and the remaining are 0.
If you could help me to solve such a problem, I would be happy :-). Thanks!

Here is the model formulated using ompr:
MyGroups <- c(1,0,0,0,1,0,0,
0,0,1,1,0,0,0,
0,0,1,0,0,1,0,
0,0,0,1,0,1,0,
0,0,1,1,0,1,0)
MyGroups <- matrix(MyGroups,nrow=5,ncol=7,byrow=TRUE)
ngroups <- nrow(MyGroups)
nobjects <- ncol(MyGroups)
coeffs <- c(50, 30, 100, 100, 200)
model <- MIPModel() %>%
add_variable(group[i], i=1:ngroups, type = 'binary') %>%
add_variable(assign[i, j], i=1:ngroups, j=1:nobjects, type = 'binary', MyGroups[i, j] == 1) %>%
set_objective(sum_over(coeffs[i] * group[i], i=1:ngroups, sense = 'max')) %>%
add_constraint(sum_over(assign[i, j], i=1:ngroups, MyGroups[i, j] == 1) <= 1, j=1:nobjects) %>%
add_constraint(assign[i, j] == group[i], i=1:ngroups, j=1:nobjects, MyGroups[i, j] == 1) %>%
add_constraint(sum_over(group[i], i=1:ngroups) <= 2)
result <- solve_model(model, with_ROI("glpk", verbose = TRUE))
result
<SOLVER MSG> ----
GLPK Simplex Optimizer, v4.47
16 rows, 16 columns, 35 non-zeros
* 0: obj = 0.000000000e+000 infeas = 0.000e+000 (11)
* 12: obj = 2.500000000e+002 infeas = 0.000e+000 (3)
OPTIMAL SOLUTION FOUND
GLPK Integer Optimizer, v4.47
16 rows, 16 columns, 35 non-zeros
16 integer variables, all of which are binary
Integer optimization begins...
+ 12: mip = not found yet <= +inf (1; 0)
+ 13: >>>>> 2.500000000e+002 <= 2.500000000e+002 0.0% (1; 0)
+ 13: mip = 2.500000000e+002 <= tree is empty 0.0% (0; 1)
INTEGER OPTIMAL SOLUTION FOUND
<!SOLVER MSG> ----
result
Status: success
Objective value: 250
ompr is a model management wrapper around the ROI package. It using an algebraic paradigm like GAMS or AMPL but has less embedded logic to simplify the syntax. Although with ompr, you can test other solvers that ROI offers as plug-ins: http://roi.r-forge.r-project.org/
Some are free, others like Mosek, CPLEX and Gurobi are commercial products. Suggest running a large subset problem and checking the relative performance of the different solvers.
Also note that your toy problem is degenerate. Group(1, 3, 4) is also a solution. I added an additional constraint that can limit the number of groups selected. If your objective function coefficients are integer values the formulation may have many degenerate solutions, a simple test is to add a small random epsilon to each of the coefficients to eliminate degeneracy and see if that improves performance.

Keying in on a couple of the OP's statements:
In the original problem I have 410 possible groups, implying 410
decision variables. The number of constraints is 49422, but in all
rows there are exactly two 1 and the remaining are 0.
and
I have 32 objects.
It seems like this can be formulated as a one-sided matching problem with a utility matrix instead of a preference matrix, which can be solved using the matchingR package.
The problem is set up using a savings matrix where the row and column indices refer to objects and each cell (and its mirror across the main diagonal) represents the savings for a group composed of two objects (the row and column).
First a smaller example with 7 objects and 15 groups:
library(matchingR) # for the roommate function
library(Rfast) # for the rowSort function
set.seed(379327748)
n <- 7L # number of objects
m <- matrix(0L, n, n) # initialize the savings matrix
# specify the savings for the 15 groups
m[which(lower.tri(m))[sample(n*(n - 1L)/2L, 15)]] <- sample(1e3, 15, TRUE)
# make the savings matrix symmetric
m[upper.tri(m)] <- t(m)[upper.tri(m)]
# the savings matrix: each cell refers to the savings for the row/column pair
# it is symmetric: 1 paired with 2 has the same savings as 2 paired with 1
m
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 0 692 429 767 133 434 619
#> [2,] 692 0 0 997 146 801 0
#> [3,] 429 0 0 214 966 683 0
#> [4,] 767 997 214 0 835 0 0
#> [5,] 133 146 966 835 0 888 513
#> [6,] 434 801 683 0 888 0 0
#> [7,] 619 0 0 0 513 0 0
# use the roommate function to get the optimal pairings
groups <- unique(rowSort(cbind(1:n, roommate(utils = m))))
# remove objects with no pair in the optimal solution (NAs happen only with an
# odd number of objects)
groups <- groups[!is.na(groups[,2]),]
groups <- groups[m[groups] > 0,]
# show the solution
groups
#> [,1] [,2]
#> [1,] 1 7
#> [2,] 2 4
#> [3,] 3 5
c(savings = sum(m[groups]))
#> savings
#> 2582
Now an example with 32 objects and 410 groups. The solution is provided almost instantly.
n <- 32L
m <- matrix(0L, n, n)
m[which(lower.tri(m))[sample(n*(n - 1L)/2L, 410)]] <- sample(1e3, 410, TRUE)
m[upper.tri(m)] <- t(m)[upper.tri(m)]
system.time(groups <- unique(rowSort(cbind(1:n, roommate(utils = m)))))
#> user system elapsed
#> 0 0 0
groups <- groups[m[groups] > 0,]
groups
#> [,1] [,2]
#> [1,] 1 15
#> [2,] 2 18
#> [3,] 3 32
#> [4,] 4 19
#> [5,] 5 30
#> [6,] 6 9
#> [7,] 7 12
#> [8,] 8 14
#> [9,] 10 29
#> [10,] 11 24
#> [11,] 13 16
#> [12,] 17 20
#> [13,] 21 27
#> [14,] 22 31
#> [15,] 23 26
#> [16,] 25 28
c(savings = sum(m[groups]))
#> savings
#> 14369
# check that each object is used only once
max(tabulate(groups, 32L))
#> [1] 1

Related

Vectorizing a for loop that changes columns of a matrix

Say I have a vector of ages of 100 trees. Then I age those trees up for 5, 10, 15, and 20 years into the future to create a matrix of tree ages for this year and four 5-year planning periods in the future.
But then, I decide to cut some of those trees (only 10 per planning period), documented in a matrix of T/F values where T is harvested and F is not (trees can't be harvested twice).
age.vec <- sample(x = 1:150, size = 100, replace = T) # create our trees
age.mat <- cbind(age.vec, age.vec+5, age.vec + 10, age.vec + 15, age.vec + 20) # grow them up
x.mat <- matrix(data = F, nrow = 100, ncol = 5) # create the empty harvest matrix
x.mat[cbind(sample(1:100, size = 50), rep(1:5, each = 10))] <- T # 10 trees/year harvested
So then, the ages of trees that are harvested become zero in that year:
age.mat[x.mat] <- 0
I then would like to age the harvested trees up again for the following periods. E.g. if a tree were harvested in the first planning period, in the second planning period (5 years later), I want the age of the tree to be 5, then in the third planning period (10 years later), I want the age of the tree to be 10. I have successfully implemented this in the following for loop:
for (i in 2:5){ # we don't need to calculate over the first year
age.mat[,i]<-age.mat[,i-1]+5L # add 5 to previous year
age.mat[x.mat[,i],i] <- 0L # reset age of harvested trees to zero
}
This works, however, it is clunky and slow. Is there a way to implement this faster (i.e. without the for loop)? It also is implemented within a function, which means that using "apply" actually slows things down, so it needs to be vectorized directly. This is something I'm iterating over thousands of times so speed is of the essence!
Thank you!
An alternative to the t(apply in #Jon Spring's answer is matrixStats::rowCumsums.
library(matrixStats)
n <- 1e4L
n10 <- n/10L
age.mat <- outer(sample(150, n, TRUE), seq(0, 20, 5), "+")
x.mat <- matrix(FALSE, n, 5) # create the empty harvest matrix
# sample harvests so that no tree is harvested twice
x.mat[matrix(c(sample(n, n/2L), sample(n10:(6L*n10 - 1L)) %/% n10), n/2L)] <- TRUE
f1 <- function(age, x) {
age[x[,1],] <- 0
for (i in 2:5){ # we don't need to calculate over the first year
age[,i] <- age[,i - 1] + 5L # add 5 to previous year
age[x[,i], i] <- 0L # reset age of harvested trees to zero
}
age
}
f2 <- function(age, x) {
age - rowCumsums(x*age)
}
microbenchmark::microbenchmark(f1 = f1(age.mat, x.mat),
f2 = f2(age.mat, x.mat),
check = "equal")
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> f1 294.4 530.2 1023.450 566.6 629.35 33222.8 100
#> f2 135.2 263.6 334.622 284.2 307.15 4343.6 100
This looks to be about 12x faster, based on testing with rbenchmark.
Here's an approach relying on the fact that harvesting a tree doesn't stop the passage of time, it just resets the clock. So we can think of a harvest as subtracting the harvest age from all future ages.
x.die <- x.mat * age.mat
x.dif <- t(apply(x.die, 1, cumsum))
age.mat2 <- age.mat - x.dif
x.die, by multiplying the harvests by the ages, we get the age at each harvest. The next line calculates the cumulative sum of these across each row, and finally we subtract those from the original ages.
I assume your "trees can't be harvested twice" means we won't ever see two TRUEs in one row of x.mat? My code won't work right if there were more than one harvest per tree location.
I found a way to do it! I implemented the idea of going backwards from #john-spring, where I created a matrix with the age of the stand at the harvested year filled in for the harvested year and all subsequent years, then subtracted that from my pre-made aged-up matrix. I built a function similar to what "fill" from tidyr or "na.locf" from zoo did (because they were too slow).
First I used arrayInd to determine the positions in the matrix of trees that were changed. I then used that to make another matrix that combined a repeat of each index row a number of times equal to the number of periods minus the period the tree was harvested in plus one, and a sequence vector of the same length that sequences from the period of the index number to the number of periods.
x.ind <- arrayInd(which(x.mat), dim(x.mat)) # gets index of row that was changed
x.new.ind <- cbind(rep(x.ind[,1], times = nper-x.ind[,2]+1), sequence(nvec = nper-x.ind[,2]+1, from = x.ind[,2]))
For example, if there was a tree harvested at position [4, 2], meaning the fourth tree was harvested in the second period, and we had 5 periods total, it would create a matrix:
[,1] [,2]
[1,] 4 2
[2,] 4 3
[3,] 4 4
[4,] 4 5
Then I made a vector with the ages of the trees that were harvested in the correct positions, and zeros in the rest of the positions (e.g. for our example, if the tree harvested was 100 years old, we would have a vector of 0 0 0 100 0 (if we had 5 trees)).
ages.vec <- vector(mode = "integer", length = nrow(age.mat))
ages.vec[x.ind[,1]]<- age.mat[x.ind]
I then multiplied this vector by a logical matrix with "T" at the row, column positions in the matrix above.
Continuing with the above example, we get:
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 100 100 100 100
[5,] 0 0 0 0 0
I then subtracted it from our current (already aged-up) ages matrix. So tree four was 95 100 105 110 115 and now it is 95 0 5 10 15.
new.ages.mat<- age.mat - replace(x.mat, x.new.ind, TRUE)*ages.vec
Though this might not be the most elegant solution, using microbenchmark, it is 90x faster than our for loop, and 3x faster than the lovely apply function that John created. I would put in the microbenchmark calls and results, but this post is long enough already! I know there's a better way to create the ages.vec and incorporate it, and am going to continue working on that, and will update this answer with my results!
This approach builds on the use of which used with arr.ind=TRUE to create a two column matrix the encodes the starting locations (in first column) and times (in second column) for new tree planting. It does violate the functional programming paradigm by using <<- to assign new values to age.mat` "in place".
fiveseq <- seq(0,20, by=5) # this way one only needs to call `seq` once
apply(which(x.mat, arr.ind=TRUE) , 1,
function(r) {age.mat[ r[1], r[2]:5] <<- fiveseq[ 1:(6-r[2])] } )
In summary, it locates the new locations and intervals and replaces the rest of that row with the right number of items from the sequence {0, 5, 10, 15, 20}
(I would be interested in seeing how this compares with the benchmarking framework that you have already established.)
You can use apply to work on each vector rowwise, then use some logic within the function to adjust the values.
Should be about 4 times faster
age.mat |>
apply(1, \(x) {
if(any(x == 0 & (which(x == 0) != length(x)))) {
x[which(x == 0):length(x)] <- (0:(length(x) - which(x == 0))) * 5
x
} else x
}) |> t()
[,1] [,2] [,3] [,4] [,5]
[1,] 101 0 5 10 15
[2,] 55 60 65 70 75
[3,] 23 28 33 0 5
[4,] 0 5 10 15 20
[5,] 23 28 33 0 5
[6,] 84 0 5 10 15
[7,] 52 57 62 0 5
[8,] 26 31 36 41 0
[9,] 114 119 124 129 0
[10,] 33 38 43 48 53
[11,] 144 149 154 159 164
[12,] 19 24 29 34 39
[13,] 43 48 53 58 63
[14,] 69 74 79 84 89
[15,] 98 103 108 113 118
[16,] 110 115 120 125 130
[17,] 8 13 18 23 28
[18,] 16 21 26 31 36
[19,] 1 6 11 16 21
[20,] 60 65 0 5 10

Shift rows of a large matrix horizontally

I am looking for a way to shift the rows of a square matrix horizontally. Particularly, my question is for a case where the dimension of a matrix is very large, say 500*500 or 1000*1000, but I am giving a small example of 5*5 here to make it clear. Assume we have the following matrix:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
I would like to shift the rows horizontally in which I get the following matrix and fill the empty cells with zero:
1 7 13 19 25
2 8 14 20 0
3 9 15 0 0
4 10 0 0 0
5 0 0 0 0
Writing code for a small matrix such as this is easy in R, but I am looking for very large matrices as I pointed out the above. Any help would be appreciated.
The example suggests you want to create a new matrix whose jth column is the jth row of the original, shifted to the left j-1 places and padded with zeros on the right, as in this calculation with a 10,000 X 10,000 matrix:
n <- 1e4
a <- matrix(seq_len(n^2), n, byrow=TRUE)
system.time({
b <- matrix(sapply(seq_len(nrow(a)), function(i) c(a[i,i:ncol(a)], rep(0, i-1))), n, n)
})
user system elapsed
0.97 0.00 0.99
(That's using a single thread and reflects a typical run out of many test runs.) One second for a matrix with 100,000,000 entries isn't bad. It's a big RAM hog though, so you might want to modify the code if the input is a sparse matrix so that it outputs a sparse matrix, too.
Reflecting on this, it occurred to me that avoiding the concatenation c and just copying in place should be faster, assuming one could initialize a matrix of zeros extremely quickly. That turns out to the be the case (and the code is even simpler):
system.time({
b <- matrix(0, nrow(a), ncol(a))
for (i in seq_len(nrow(a))) b[1:(n+1-i), i] <- a[i, i:ncol(a)]
})
user system elapsed
0.62 0.00 0.62
It's about 50% faster. Since the loop overhead will be relatively small and the body of the loop is (presumably) an optimized vector copy, it's unlikely an appreciably faster single-threaded solution exists.

Calculate Total Sum of Square Inconsistency

I am attempting to write my own function for total sum of square, within sum of square, and between sum of square in R Studio for my own implementation of k-means.
I've successfully written the function for within sum of square, but I'm having difficulty with total sum of square (and thus bss). The result I get is significantly larger than what R's own kmeans function computes. I'm confused because I am following exactly what formulas provide. Here is my data:
A =
36 3
73 3
30 3
49 3
47 11
47 11
0 7
46 5
16 3
52 4
0 8
21 3
0 4
57 6
31 5
0 6
40 3
31 5
38 4
0 5
59 4
61 6
48 7
29 2
0 4
19 4
19 3
48 9
48 4
21 5
where each column is a feature. This is the function I've created thus far for tss:
tot_sumoSq <- function(data){
avg = mean( as.matrix(data) )
r = matrix(avg, nrow(data), ncol(data))
tot_sumoSq = sum( (data - r)^2 )
}
I receive the result 24342.4, but R gives 13244.8. Am I completely missing something?
The latter value is calculated using the column means. If you use this for calculating the means, you'll get the same answer.
avg = colMeans(data)
r = matrix(avg, nrow(data), ncol(data), byrow=T)
[1] 13244.8
May be there are something wrong in your program. You subtract a matrix from a data frame. Use the following -
tot_sumoSq <- function(data){
data = as.matrix(data)
x = sum((data - mean(data))^2)
return(x)
}
From my side it gives the correct answer.
I found a solution to my issue by combining solutions provided by the first two commentators. I see what my previous mistake was and would like to clear any confusion for future scientists.
tot_sumoSq <- function(data){
avg = colMeans(data)
r = matrix(avg, nrow(data), ncol(data), byrow = T)
data = as.matrix(data)
return( sum( (data - r)^2 ) )
}
Each column is the entire sample for different features, so when we calculate the mean for each column, it is the mean of means for the entire sample for one feature. My conceptual mistake earlier was to combine both features to calculate an overall mean.

Optimization with Multiple Constraints in R

I have a question about optimizing program use across multiple rooms in a building using a couple constraints. My question is how can I ask for the optimum program in all 4 classrooms based on the data and constraints at the same time?
The following is a small sample of my data.
Program sf 101 201 301 401
(1) Offices 120 6 6 5 5
(2) Gallery 1000 5 5 4 4
(3) Reception / Greeter 300 6 6 5 7
(4) Studio / Classroom 800 6 6 5 5
101, 201, 301, and 401 represent 4 rooms in a building.
Program is the potential room use.
sf represents the maximum square feet of the intended program (one constraint).
The data represents how many criteria the intended program matches with attributes of each room (this is the number I want to maximize).
I would also like to place constraints that would allow me to say I only want a certain number of Offices, Galleries, etc. in the building.
e.g., I want 1 Reception/Greeter and 2 Offices, and the last room can be filled by best available matches. Room 101 has a maximum sf of 150, Room 201 has a maximum sf of 250 sf, Room 301 has a maximum sf of 1500, and Room 401 has a maximum sf of 500 (these constraints are not in my data frame because I couldn't think of a good way to do include them).
This example should return 101 = Offices, 201 = Offices, 301 = one of the three excluding Gallery, and 401 = Reception / Greeter.
Update:
Objective should be something like this (I want them all maximized):
obj <- data$101, data$201, data$301, data$401 (and probably data$sf too)
Then I'm not really sure how to write the constraints but they would be this:
data$101
data$sf <= 150
number of solutions from this column should be 1
data$201
data$sf <= 250
number of solutions from this column should be 1
...
data$401
data$sf <= 500
number of solutions from this column should be 1
And then finally somehow restrict the number of "Offices", "Gallery", "Reception / Greeter", "Studio / Classroom".
Maybe something like:
as.numeric(data$Program %in% c("(1) Offices") == 1
Hopefully this clarifies some things.
It sounds like you are assigning programs (rows in data) to rooms (columns in data) such that you maximize the value of the assignment. You can only assign a program to a room if the room is big enough and you can only assign one program to a room, though the same program can be assigned to multiple rooms (e.g. you could assign "Office" to both room 101 and 201). Therefore, your data really consists of program sizes, room sizes, and objective values:
program.size <- c(120, 1000, 300, 800)
room.size <- c(150, 250, 1500, 500)
(obj.vals <- matrix(c(6, 5, 6, 6, 6, 5, 6, 6, 5, 4, 5, 5, 5, 4, 7, 5), nrow=4))
# [,1] [,2] [,3] [,4]
# [1,] 6 6 5 5
# [2,] 5 5 4 4
# [3,] 6 6 5 7
# [4,] 6 6 5 5
A simple way to block assigning a program to a room that is too small would be to set the objective value for such an assignment to a low value (I'll use 0 here):
(obj.adj <- obj.vals * outer(program.size, room.size, "<="))
# [,1] [,2] [,3] [,4]
# [1,] 6 6 5 5
# [2,] 0 0 4 0
# [3,] 0 0 5 7
# [4,] 0 0 5 0
Now, you can approach this problem using integer programming, defining a variable x_pr that takes value 1 if program p is assigned to room r and 0 otherwise. You can code up the objective and constraints pretty easily using the lpSolve package in R:
# Convenience variables
nr <- nrow(obj.adj)
nc <- ncol(obj.adj)
# Model
library(lpSolve)
mod <- lp("max",
as.vector(obj.adj),
t(1*sapply(1:nc, function(x) rep(1:nc == x, each=nr))),
rep("<=", nc),
rep(1, nc),
all.bin=TRUE)
matrix(mod$solution, nrow=nr)
# [,1] [,2] [,3] [,4]
# [1,] 1 1 0 0
# [2,] 0 0 0 0
# [3,] 0 0 0 1
# [4,] 0 0 1 0
Now we've assigned "Office" to rooms 101 and 201, "Studio/Classroom" to room 301, and "Reception/Greeter" to room 401.
It's worth noting that this program could easily be solved by selecting the program with the largest value for each room in obj.adj, so the use of lpSolve is only warranted if you have more complicated constraints than the ones mentioned in the question.

GBM Rule Generation - Coding Advice

I use the R package GBM as probably my first choice for predictive modeling. There are so many great things about this algorithm but the one "bad" is that I cant easily use model code to score new data outside of R. I want to write code that can be used in SAS or other system (I will start with SAS (no access to IML)).
Lets say I have the following data set (from GBM manual) and model code:
library(gbm)
set.seed(1234)
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)
# introduce some missing values
#X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA
X3[sample(1:N,size=30)] <- NA
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
# fit initial model
gbm1 <- gbm(Y~X1+X2+X3+X4+X5+X6, # formula
data=data, # dataset
var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
distribution="gaussian",
n.trees=2, # number of trees
shrinkage=0.005, # shrinkage or learning rate,
# 0.001 to 0.1 usually work
interaction.depth=5, # 1: additive model, 2: two-way interactions, etc.
bag.fraction = 1, # subsampling fraction, 0.5 is probably best
train.fraction = 1, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each node
cv.folds = 5, # do 5-fold cross-validation
keep.data=TRUE, # keep a copy of the dataset with the object
verbose=TRUE) # print out progress
Now I can see the individual trees using pretty.gbm.tree as in
pretty.gbm.tree(gbm1,i.tree = 1)[1:7]
which yields
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight
0 2 1.5000000000 1 8 15 983.34315 1000
1 1 1.0309565491 2 6 7 190.62220 501
2 2 0.5000000000 3 4 5 75.85130 277
3 -1 -0.0102671518 -1 -1 -1 0.00000 139
4 -1 -0.0050342273 -1 -1 -1 0.00000 138
5 -1 -0.0076601353 -1 -1 -1 0.00000 277
6 -1 -0.0014569934 -1 -1 -1 0.00000 224
7 -1 -0.0048866747 -1 -1 -1 0.00000 501
8 1 0.6015416372 9 10 14 160.97007 469
9 -1 0.0007403551 -1 -1 -1 0.00000 142
10 2 2.5000000000 11 12 13 85.54573 327
11 -1 0.0046278704 -1 -1 -1 0.00000 168
12 -1 0.0097445692 -1 -1 -1 0.00000 159
13 -1 0.0071158065 -1 -1 -1 0.00000 327
14 -1 0.0051854993 -1 -1 -1 0.00000 469
15 -1 0.0005408284 -1 -1 -1 0.00000 30
The manual page 18 shows the following:
Based on the manual, the first split occurs on the 3rd variable (zero based in this output) which is gbm1$var.names[3] "X3". The variable is ordered factor.
types<-lapply (lapply(data[,gbm1$var.names],class), function(i) ifelse (strsplit(i[1]," ")[1]=="ordered","ordered",i))
types[3]
So, the split is at 1.5 meaning the value 'd and c' levels[[3]][1:2.5] (also zero based) splits to left node and the others levels[[3]][3:4] go to the right.
Next, the rule continues with a split at gbm1$var.names[2] as denoted by SplitVar=1 in the row indexed 1.
Has anyone written anything to move through this data structure (for each tree), constructing rules such as:
"If X3 in ('d','c') and X2<1.0309565491 and X3 in ('d') then scoreTreeOne= -0.0102671518"
which is how I think the first rule from this tree reads.
Or have any advice how to best do this?
The mlmeta package has a function gbm2sas that exports a GBM model from R to SAS.
Here is a very generic answer of how this might be done.
Add some R code to write the output to a file. https://stat.ethz.ch/R-manual/R-devel/library/base/html/sink.html
Then through SAS, access the ability to execute R with: http://support.sas.com/documentation/cdl/en/hostunx/61879/HTML/default/viewer.htm#a000303551.htm
(You'll need to know where your R executable is to point the R code you have written above at the executable)
From there you should be able to manipulate the output within SAS to do any scoring you may need.
If it is simply a one time scoring and not a process, omit the SAS execution of R and simply develop SAS code to parse through the R output file.

Resources