Optimization with Multiple Constraints in R - r

I have a question about optimizing program use across multiple rooms in a building using a couple constraints. My question is how can I ask for the optimum program in all 4 classrooms based on the data and constraints at the same time?
The following is a small sample of my data.
Program sf 101 201 301 401
(1) Offices 120 6 6 5 5
(2) Gallery 1000 5 5 4 4
(3) Reception / Greeter 300 6 6 5 7
(4) Studio / Classroom 800 6 6 5 5
101, 201, 301, and 401 represent 4 rooms in a building.
Program is the potential room use.
sf represents the maximum square feet of the intended program (one constraint).
The data represents how many criteria the intended program matches with attributes of each room (this is the number I want to maximize).
I would also like to place constraints that would allow me to say I only want a certain number of Offices, Galleries, etc. in the building.
e.g., I want 1 Reception/Greeter and 2 Offices, and the last room can be filled by best available matches. Room 101 has a maximum sf of 150, Room 201 has a maximum sf of 250 sf, Room 301 has a maximum sf of 1500, and Room 401 has a maximum sf of 500 (these constraints are not in my data frame because I couldn't think of a good way to do include them).
This example should return 101 = Offices, 201 = Offices, 301 = one of the three excluding Gallery, and 401 = Reception / Greeter.
Update:
Objective should be something like this (I want them all maximized):
obj <- data$101, data$201, data$301, data$401 (and probably data$sf too)
Then I'm not really sure how to write the constraints but they would be this:
data$101
data$sf <= 150
number of solutions from this column should be 1
data$201
data$sf <= 250
number of solutions from this column should be 1
...
data$401
data$sf <= 500
number of solutions from this column should be 1
And then finally somehow restrict the number of "Offices", "Gallery", "Reception / Greeter", "Studio / Classroom".
Maybe something like:
as.numeric(data$Program %in% c("(1) Offices") == 1
Hopefully this clarifies some things.

It sounds like you are assigning programs (rows in data) to rooms (columns in data) such that you maximize the value of the assignment. You can only assign a program to a room if the room is big enough and you can only assign one program to a room, though the same program can be assigned to multiple rooms (e.g. you could assign "Office" to both room 101 and 201). Therefore, your data really consists of program sizes, room sizes, and objective values:
program.size <- c(120, 1000, 300, 800)
room.size <- c(150, 250, 1500, 500)
(obj.vals <- matrix(c(6, 5, 6, 6, 6, 5, 6, 6, 5, 4, 5, 5, 5, 4, 7, 5), nrow=4))
# [,1] [,2] [,3] [,4]
# [1,] 6 6 5 5
# [2,] 5 5 4 4
# [3,] 6 6 5 7
# [4,] 6 6 5 5
A simple way to block assigning a program to a room that is too small would be to set the objective value for such an assignment to a low value (I'll use 0 here):
(obj.adj <- obj.vals * outer(program.size, room.size, "<="))
# [,1] [,2] [,3] [,4]
# [1,] 6 6 5 5
# [2,] 0 0 4 0
# [3,] 0 0 5 7
# [4,] 0 0 5 0
Now, you can approach this problem using integer programming, defining a variable x_pr that takes value 1 if program p is assigned to room r and 0 otherwise. You can code up the objective and constraints pretty easily using the lpSolve package in R:
# Convenience variables
nr <- nrow(obj.adj)
nc <- ncol(obj.adj)
# Model
library(lpSolve)
mod <- lp("max",
as.vector(obj.adj),
t(1*sapply(1:nc, function(x) rep(1:nc == x, each=nr))),
rep("<=", nc),
rep(1, nc),
all.bin=TRUE)
matrix(mod$solution, nrow=nr)
# [,1] [,2] [,3] [,4]
# [1,] 1 1 0 0
# [2,] 0 0 0 0
# [3,] 0 0 0 1
# [4,] 0 0 1 0
Now we've assigned "Office" to rooms 101 and 201, "Studio/Classroom" to room 301, and "Reception/Greeter" to room 401.
It's worth noting that this program could easily be solved by selecting the program with the largest value for each room in obj.adj, so the use of lpSolve is only warranted if you have more complicated constraints than the ones mentioned in the question.

Related

Large mixed integer programming in R - possible to solve?

I would like to solve a large mixed integer programming problem, and I have tried with R, package lpSolveAPI. The problem is large - 410 variables each of which can be either 0 or 1, and about 49422 constraints. I have tried to let it run for 1,5 days, but once I try to stop it, it says that R needs to be terminated. The same happens if I let it run for short time, say 15 minutes, and then try to stop it by clicking on the red button. Since this happens, I am not sure whether there is something wrong with my computer or whether such problem is way too large for a computer. When it runs, it uses maybe 20% of the CPU power and about 70% of memory. My computer is a 2022 Lenovo X1 Yoga with i7 2.80GHz processor and 16GB of ram.
The problem itself is constructed in the following way:
library(lpSolveAPI)
#A has length of 410. No constraints to begin with
lprec <- make.lp(0, 410)
set.objfn(lprec, round(A,0))
lp.control(lprec, sense="max")
set.type(lprec,c(1:A),"binary")
#Defining the constraints with a for loop. Will not go into details, but it adds 49422 constraints
for (){
...
add.constraint(lprec, MyConstraint, "<=", 1)
...
}
lprec
#This says: Model name: a linear program with 410 decision variables and 49422 constraints
solve(lprec)
The vector "MyConstraint" is different in every iteration, but it has length 410 where 408 elements are 0 and two elements are 1.
That is, I have 410 objects in total, and I want to choose a set of those objects (1 if an object chosen and 0 otherwise) such that the objective function is maximized. However, some pairs of objects are not allowed, and so each of the 49422 constraints specifies which two objects cannot be chosen chosen at once: each constraint says that the sum cannot be above 1.
So, my question is if there is any way to solve this? If not, how large can such problem be in order to be solvable?
Thank you!
EDIT: ---------------------------------------------
In the comments I was asked to provide an example, so here is it. A similar, but much smaller problem. Suppose we have 7 different objects, and these can allocated into 5 groups. Let us define the groups and the associated savings denoted by A:
MyGroups <- c(1,0,0,0,1,0,0,
0,0,1,1,0,0,0,
0,0,1,0,0,1,0,
0,0,0,1,0,1,0,
0,0,1,1,0,1,0)
MyGroups <- matrix(MyGroups,nrow=5,ncol=7,byrow=TRUE)
rownames(MyGroups) <- paste0("Group", 1:5)
colnames(MyGroups) <- paste0("Object", 1:7)
A=c(50,30,100,100,200)
That is, group 1 consists of Object 1 and Object 5 (denoted by the first row in the matrix MyGroups). Such a group will give a saving of 50. Objective: to maximize the total saving by choosing the right groups. Problem: each object can only be a part of one group. For example, if group 2 is implemented, then group 3 cannot be implemented, since both groups require object 3. Here we see that the optimal solution is to choose Group 1 and Group 5, which will give a total saving of 50+200=250. I want to be able to find this for a bigger problem. So, first I can create a matrix with constraints where specifies which 2 groups cannot be implemented at the same time.
lprec2 <- make.lp(0, 5)
set.objfn(lprec2, A)
lp.control(lprec2, sense="max")
set.type(lprec2,c(1:5),"binary")
#Defining the constraints
for (i in 1:(5-1)){
for (j in (i+1):5) {
if(max(colSums(MyGroups[c(i,j),]))>1){
#group i and group j cannot be together. Add constraint
MyConstraint=integer(5)
MyConstraint[c(i,j)]=1
add.constraint(lprec2, MyConstraint, "<=", 1)
}
}
}
lprec2
This gives the following mixed integer problem:
When I solve it, then the solution is:
solve(lprec2)
get.objective(lprec2)
get.variables(lprec2)
Which gives 250 and (1 0 0 0 1) respectively.
In the original problem I have 410 possible groups, implying 410 decision variables. The number of constraints is 49422, but in all rows there are exactly two 1 and the remaining are 0.
If you could help me to solve such a problem, I would be happy :-). Thanks!
Here is the model formulated using ompr:
MyGroups <- c(1,0,0,0,1,0,0,
0,0,1,1,0,0,0,
0,0,1,0,0,1,0,
0,0,0,1,0,1,0,
0,0,1,1,0,1,0)
MyGroups <- matrix(MyGroups,nrow=5,ncol=7,byrow=TRUE)
ngroups <- nrow(MyGroups)
nobjects <- ncol(MyGroups)
coeffs <- c(50, 30, 100, 100, 200)
model <- MIPModel() %>%
add_variable(group[i], i=1:ngroups, type = 'binary') %>%
add_variable(assign[i, j], i=1:ngroups, j=1:nobjects, type = 'binary', MyGroups[i, j] == 1) %>%
set_objective(sum_over(coeffs[i] * group[i], i=1:ngroups, sense = 'max')) %>%
add_constraint(sum_over(assign[i, j], i=1:ngroups, MyGroups[i, j] == 1) <= 1, j=1:nobjects) %>%
add_constraint(assign[i, j] == group[i], i=1:ngroups, j=1:nobjects, MyGroups[i, j] == 1) %>%
add_constraint(sum_over(group[i], i=1:ngroups) <= 2)
result <- solve_model(model, with_ROI("glpk", verbose = TRUE))
result
<SOLVER MSG> ----
GLPK Simplex Optimizer, v4.47
16 rows, 16 columns, 35 non-zeros
* 0: obj = 0.000000000e+000 infeas = 0.000e+000 (11)
* 12: obj = 2.500000000e+002 infeas = 0.000e+000 (3)
OPTIMAL SOLUTION FOUND
GLPK Integer Optimizer, v4.47
16 rows, 16 columns, 35 non-zeros
16 integer variables, all of which are binary
Integer optimization begins...
+ 12: mip = not found yet <= +inf (1; 0)
+ 13: >>>>> 2.500000000e+002 <= 2.500000000e+002 0.0% (1; 0)
+ 13: mip = 2.500000000e+002 <= tree is empty 0.0% (0; 1)
INTEGER OPTIMAL SOLUTION FOUND
<!SOLVER MSG> ----
result
Status: success
Objective value: 250
ompr is a model management wrapper around the ROI package. It using an algebraic paradigm like GAMS or AMPL but has less embedded logic to simplify the syntax. Although with ompr, you can test other solvers that ROI offers as plug-ins: http://roi.r-forge.r-project.org/
Some are free, others like Mosek, CPLEX and Gurobi are commercial products. Suggest running a large subset problem and checking the relative performance of the different solvers.
Also note that your toy problem is degenerate. Group(1, 3, 4) is also a solution. I added an additional constraint that can limit the number of groups selected. If your objective function coefficients are integer values the formulation may have many degenerate solutions, a simple test is to add a small random epsilon to each of the coefficients to eliminate degeneracy and see if that improves performance.
Keying in on a couple of the OP's statements:
In the original problem I have 410 possible groups, implying 410
decision variables. The number of constraints is 49422, but in all
rows there are exactly two 1 and the remaining are 0.
and
I have 32 objects.
It seems like this can be formulated as a one-sided matching problem with a utility matrix instead of a preference matrix, which can be solved using the matchingR package.
The problem is set up using a savings matrix where the row and column indices refer to objects and each cell (and its mirror across the main diagonal) represents the savings for a group composed of two objects (the row and column).
First a smaller example with 7 objects and 15 groups:
library(matchingR) # for the roommate function
library(Rfast) # for the rowSort function
set.seed(379327748)
n <- 7L # number of objects
m <- matrix(0L, n, n) # initialize the savings matrix
# specify the savings for the 15 groups
m[which(lower.tri(m))[sample(n*(n - 1L)/2L, 15)]] <- sample(1e3, 15, TRUE)
# make the savings matrix symmetric
m[upper.tri(m)] <- t(m)[upper.tri(m)]
# the savings matrix: each cell refers to the savings for the row/column pair
# it is symmetric: 1 paired with 2 has the same savings as 2 paired with 1
m
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 0 692 429 767 133 434 619
#> [2,] 692 0 0 997 146 801 0
#> [3,] 429 0 0 214 966 683 0
#> [4,] 767 997 214 0 835 0 0
#> [5,] 133 146 966 835 0 888 513
#> [6,] 434 801 683 0 888 0 0
#> [7,] 619 0 0 0 513 0 0
# use the roommate function to get the optimal pairings
groups <- unique(rowSort(cbind(1:n, roommate(utils = m))))
# remove objects with no pair in the optimal solution (NAs happen only with an
# odd number of objects)
groups <- groups[!is.na(groups[,2]),]
groups <- groups[m[groups] > 0,]
# show the solution
groups
#> [,1] [,2]
#> [1,] 1 7
#> [2,] 2 4
#> [3,] 3 5
c(savings = sum(m[groups]))
#> savings
#> 2582
Now an example with 32 objects and 410 groups. The solution is provided almost instantly.
n <- 32L
m <- matrix(0L, n, n)
m[which(lower.tri(m))[sample(n*(n - 1L)/2L, 410)]] <- sample(1e3, 410, TRUE)
m[upper.tri(m)] <- t(m)[upper.tri(m)]
system.time(groups <- unique(rowSort(cbind(1:n, roommate(utils = m)))))
#> user system elapsed
#> 0 0 0
groups <- groups[m[groups] > 0,]
groups
#> [,1] [,2]
#> [1,] 1 15
#> [2,] 2 18
#> [3,] 3 32
#> [4,] 4 19
#> [5,] 5 30
#> [6,] 6 9
#> [7,] 7 12
#> [8,] 8 14
#> [9,] 10 29
#> [10,] 11 24
#> [11,] 13 16
#> [12,] 17 20
#> [13,] 21 27
#> [14,] 22 31
#> [15,] 23 26
#> [16,] 25 28
c(savings = sum(m[groups]))
#> savings
#> 14369
# check that each object is used only once
max(tabulate(groups, 32L))
#> [1] 1

Vectorizing a for loop that changes columns of a matrix

Say I have a vector of ages of 100 trees. Then I age those trees up for 5, 10, 15, and 20 years into the future to create a matrix of tree ages for this year and four 5-year planning periods in the future.
But then, I decide to cut some of those trees (only 10 per planning period), documented in a matrix of T/F values where T is harvested and F is not (trees can't be harvested twice).
age.vec <- sample(x = 1:150, size = 100, replace = T) # create our trees
age.mat <- cbind(age.vec, age.vec+5, age.vec + 10, age.vec + 15, age.vec + 20) # grow them up
x.mat <- matrix(data = F, nrow = 100, ncol = 5) # create the empty harvest matrix
x.mat[cbind(sample(1:100, size = 50), rep(1:5, each = 10))] <- T # 10 trees/year harvested
So then, the ages of trees that are harvested become zero in that year:
age.mat[x.mat] <- 0
I then would like to age the harvested trees up again for the following periods. E.g. if a tree were harvested in the first planning period, in the second planning period (5 years later), I want the age of the tree to be 5, then in the third planning period (10 years later), I want the age of the tree to be 10. I have successfully implemented this in the following for loop:
for (i in 2:5){ # we don't need to calculate over the first year
age.mat[,i]<-age.mat[,i-1]+5L # add 5 to previous year
age.mat[x.mat[,i],i] <- 0L # reset age of harvested trees to zero
}
This works, however, it is clunky and slow. Is there a way to implement this faster (i.e. without the for loop)? It also is implemented within a function, which means that using "apply" actually slows things down, so it needs to be vectorized directly. This is something I'm iterating over thousands of times so speed is of the essence!
Thank you!
An alternative to the t(apply in #Jon Spring's answer is matrixStats::rowCumsums.
library(matrixStats)
n <- 1e4L
n10 <- n/10L
age.mat <- outer(sample(150, n, TRUE), seq(0, 20, 5), "+")
x.mat <- matrix(FALSE, n, 5) # create the empty harvest matrix
# sample harvests so that no tree is harvested twice
x.mat[matrix(c(sample(n, n/2L), sample(n10:(6L*n10 - 1L)) %/% n10), n/2L)] <- TRUE
f1 <- function(age, x) {
age[x[,1],] <- 0
for (i in 2:5){ # we don't need to calculate over the first year
age[,i] <- age[,i - 1] + 5L # add 5 to previous year
age[x[,i], i] <- 0L # reset age of harvested trees to zero
}
age
}
f2 <- function(age, x) {
age - rowCumsums(x*age)
}
microbenchmark::microbenchmark(f1 = f1(age.mat, x.mat),
f2 = f2(age.mat, x.mat),
check = "equal")
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> f1 294.4 530.2 1023.450 566.6 629.35 33222.8 100
#> f2 135.2 263.6 334.622 284.2 307.15 4343.6 100
This looks to be about 12x faster, based on testing with rbenchmark.
Here's an approach relying on the fact that harvesting a tree doesn't stop the passage of time, it just resets the clock. So we can think of a harvest as subtracting the harvest age from all future ages.
x.die <- x.mat * age.mat
x.dif <- t(apply(x.die, 1, cumsum))
age.mat2 <- age.mat - x.dif
x.die, by multiplying the harvests by the ages, we get the age at each harvest. The next line calculates the cumulative sum of these across each row, and finally we subtract those from the original ages.
I assume your "trees can't be harvested twice" means we won't ever see two TRUEs in one row of x.mat? My code won't work right if there were more than one harvest per tree location.
I found a way to do it! I implemented the idea of going backwards from #john-spring, where I created a matrix with the age of the stand at the harvested year filled in for the harvested year and all subsequent years, then subtracted that from my pre-made aged-up matrix. I built a function similar to what "fill" from tidyr or "na.locf" from zoo did (because they were too slow).
First I used arrayInd to determine the positions in the matrix of trees that were changed. I then used that to make another matrix that combined a repeat of each index row a number of times equal to the number of periods minus the period the tree was harvested in plus one, and a sequence vector of the same length that sequences from the period of the index number to the number of periods.
x.ind <- arrayInd(which(x.mat), dim(x.mat)) # gets index of row that was changed
x.new.ind <- cbind(rep(x.ind[,1], times = nper-x.ind[,2]+1), sequence(nvec = nper-x.ind[,2]+1, from = x.ind[,2]))
For example, if there was a tree harvested at position [4, 2], meaning the fourth tree was harvested in the second period, and we had 5 periods total, it would create a matrix:
[,1] [,2]
[1,] 4 2
[2,] 4 3
[3,] 4 4
[4,] 4 5
Then I made a vector with the ages of the trees that were harvested in the correct positions, and zeros in the rest of the positions (e.g. for our example, if the tree harvested was 100 years old, we would have a vector of 0 0 0 100 0 (if we had 5 trees)).
ages.vec <- vector(mode = "integer", length = nrow(age.mat))
ages.vec[x.ind[,1]]<- age.mat[x.ind]
I then multiplied this vector by a logical matrix with "T" at the row, column positions in the matrix above.
Continuing with the above example, we get:
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 100 100 100 100
[5,] 0 0 0 0 0
I then subtracted it from our current (already aged-up) ages matrix. So tree four was 95 100 105 110 115 and now it is 95 0 5 10 15.
new.ages.mat<- age.mat - replace(x.mat, x.new.ind, TRUE)*ages.vec
Though this might not be the most elegant solution, using microbenchmark, it is 90x faster than our for loop, and 3x faster than the lovely apply function that John created. I would put in the microbenchmark calls and results, but this post is long enough already! I know there's a better way to create the ages.vec and incorporate it, and am going to continue working on that, and will update this answer with my results!
This approach builds on the use of which used with arr.ind=TRUE to create a two column matrix the encodes the starting locations (in first column) and times (in second column) for new tree planting. It does violate the functional programming paradigm by using <<- to assign new values to age.mat` "in place".
fiveseq <- seq(0,20, by=5) # this way one only needs to call `seq` once
apply(which(x.mat, arr.ind=TRUE) , 1,
function(r) {age.mat[ r[1], r[2]:5] <<- fiveseq[ 1:(6-r[2])] } )
In summary, it locates the new locations and intervals and replaces the rest of that row with the right number of items from the sequence {0, 5, 10, 15, 20}
(I would be interested in seeing how this compares with the benchmarking framework that you have already established.)
You can use apply to work on each vector rowwise, then use some logic within the function to adjust the values.
Should be about 4 times faster
age.mat |>
apply(1, \(x) {
if(any(x == 0 & (which(x == 0) != length(x)))) {
x[which(x == 0):length(x)] <- (0:(length(x) - which(x == 0))) * 5
x
} else x
}) |> t()
[,1] [,2] [,3] [,4] [,5]
[1,] 101 0 5 10 15
[2,] 55 60 65 70 75
[3,] 23 28 33 0 5
[4,] 0 5 10 15 20
[5,] 23 28 33 0 5
[6,] 84 0 5 10 15
[7,] 52 57 62 0 5
[8,] 26 31 36 41 0
[9,] 114 119 124 129 0
[10,] 33 38 43 48 53
[11,] 144 149 154 159 164
[12,] 19 24 29 34 39
[13,] 43 48 53 58 63
[14,] 69 74 79 84 89
[15,] 98 103 108 113 118
[16,] 110 115 120 125 130
[17,] 8 13 18 23 28
[18,] 16 21 26 31 36
[19,] 1 6 11 16 21
[20,] 60 65 0 5 10

Probability of account win/loss using Bayesian Statistics

I am trying to estimate the probability of winning or losing an account, and I'd like to do this using Bayesian Methods. I'm not really that familiar with these methods, but I think I understand the general idea.
I know some information about losses and wins. Wins are usually characterized by some combination of activities; losses are usually characters by a different combination of activities. I'd like to be able to get some posterior probability of whether or not a new observation will be won or lost based on the current number of activities that are associated with that account.
Here is an example of my data: (This is just a sample for simplicity)
Email Call Callback Outcome
14 9 2 1
3 2 4 0
16 14 2 0
15 1 3 1
5 2 2 0
1 1 0 0
10 3 5 0
2 0 1 0
17 8 4 1
3 15 2 0
17 1 3 0
10 7 5 0
10 2 3 0
8 0 0 1
14 10 3 0
1 9 3 1
5 10 3 1
13 5 1 0
9 4 4 0
So from here I know that 30% of the observations have an outcome of 1 (win) and 70% have an outcome of 0 (loss). Let's say that I want to use the other columns to get a probability of win/loss for a new observation which may have a small number of events (emails, calls, and callbacks) associated with it.
Now let's say that I want to use the counts/proportions of the different events as priors for a new observation. This is where I start getting tripped up. My thinking is to create a dirichlet distribution for wins and losses, so two separate distributions, one for wins and one for losses. Using the counts/proportions of events for each outcome as the priors. I guess I'm not sure how to do this in R. I think my course of action would be estimate a dirichlet distribution (since I have 3 variables) for each outcome using maximum likelihood. I've been trying to use the dirichlet.simul and dirichlet.mle functions from the sirt package in R. I'm not sure if I need to simulate one first?
Another issue is once I have this distribution, it's unclear to me how to get a posterior distribution of a new observation. I've read several papers and can't seem to find a straightforward process on how to do this. (Or maybe there's some holes in my understanding). Any pushes in the right direction would be greatly appreciated.
This is the code I've tried so far:
### FOR WON ACCOUNTS
set.seed(789)
N <- 6
probs <- c(0.535714286, 0.330357143, 0.133928571 )
alpha <- probs
alpha <- matrix( alpha , nrow=N , ncol=length(alpha) , byrow=TRUE )
x <- dirichlet.simul( alpha )
dirichlet.mle(x)
$alpha
[1] 0.3385607 0.2617939 0.1972898
$alpha0
[1] 0.7976444
$xsi
[1] 0.4244507 0.3282088 0.2473405
### FOR LOST ACCOUNTS
set.seed(789)
N2 <- 14
probs2 <- c(0.528037383,0.308411215,0.163551402 )
alpha2 <- probs2
alpha2 <- matrix( alpha2 , nrow=N , ncol=length(alpha2) , byrow=TRUE )
x2 <- dirichlet.simul( alpha2 )
dirichlet.mle(x2)
$alpha
[1] 0.3388486 0.2488771 0.2358043
$alpha0
[1] 0.8235301
$xsi
[1] 0.4114587 0.3022077 0.2863336
Not sure if this is a correct approach or how to get posteriors from here. I realize all the outputs look similar across won/lost accounts. I just used some simulated data to represent what I'm working with.

Creating large matrices in R in reasonable time

I am working on a movie recommender predicts a user's movie rating for an unseen movie. Most of the work is done and I have created a 7000x3000 matrix userRatingsNew containing 7000 users and their ratings for 3000 movies, replacing all the missing values with the predicted rating.
I was provided two other files, mapping and test, and used read.csv() to load them into matrices of the following format.
mapping is a 8,400,000x3 matrix that contains id, user, movie, where id is basically the transaction id associated with a user's rating of movie x.
test is a 8,400,000x2 matrix that contains id, rating, where rating is the user's rating for that movie associated with id. The values in the rating column are empty and I need to fill those in using the predicted values that I have already calculated.
Here is my code
writeResult <- function(userRatingsNew, mapping, test, writeToFile = FALSE){
start <- Sys.time()
result <- test
entries <- nrow(test)
for (i in 1:entries){
result[i,2] <- userRatingsNew[mapping[i,2], mapping[i,3]]
}
if (writeToFile)
write.csv(result, "result.csv", row.names=FALSE)
print(Sys.time()-start)
return(result)
}
My problem is that for i=1:100, it takes ~7 seconds. So in order to process all 8.4 million entries, it'd take ~163 hours. I tried using doMC() and implemented parallel processing, but I ran into the problem where my computer ran out of memory. What exactly can I do to speed this process up?
You can index a matrix with another matrix, as in:
M <- matrix(1:25,nc=5,nr=5)
M
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 6 11 16 21
# [2,] 2 7 12 17 22
# [3,] 3 8 13 18 23
# [4,] 4 9 14 19 24
# [5,] 5 10 15 20 25
m <- cbind(1:5,5:1)
m
# [,1] [,2]
# [1,] 1 5
# [2,] 2 4
# [3,] 3 3
# [4,] 4 2
# [5,] 5 1
M[m]
# [1] 21 17 13 9 5
So try
result[,2] <- userRatingsNew[mapping[,2:3]]
You should not need a loop.
A thought:
Instead of the 3000-sized dimension attached directly to the 7000-sized dimension, for each user you can attach an array which specifies the movie id/number/place in array, and their rating, in a series of 2d datapoints. Presumably most users will not rate all 3000 films. Let's say they rate 20 movies on average, and in each of 20 cases now it calls the array of movie names by correctly referring to the location in the array, then now you only need (7000) x (20x2+20) things going on, where 20x2 refers to the 20 ratings plus the reference to the film, and the other 20 is the fact of retrieving the film name. You can compile all reports first using array location and attach the name referring to an array of film names.

filling matrix with circular patern

I want to write a function that fill a matrix m by m where m is odd as follows :
1) it's starts from middle cell of matrix (for example for 5 by 5 A, matrix middle cell are A[2,2] ) , and put number 1 there
2) it's go one cell forward and add 1 to previous cell and put it in second cell
3) it's go down and put 3, left 4, left 5, up 6, up 7,...
for example the resulting matrix could be like this :
> 7 8 9
6 1 2
5 4 3
could somebody help me to implement?
max_x=5
len=max_x^2
middle=ceiling(max_x/2)
A=matrix(NA,max_x,max_x)
increments=Reduce(
f=function(lhs,rhs) c(lhs,(-1)^(rhs/2+1)*rep(1,rhs)),
x=2*(1:(max_x)),
init=0
)[1:len]
idx_x=Reduce(
f=function(lhs,rhs) c(lhs,rep(c(TRUE,FALSE),each=rhs)),
1:max_x,
init=FALSE
)[1:len]
increments_x=increments
increments_y=increments
increments_x[!idx_x]=0
increments_y[idx_x]=0
A[(middle+cumsum(increments_x)-1)*(max_x)+middle+cumsum(increments_y)]=1:(max_x^2)
Gives
#> A
# [,1] [,2] [,3] [,4] [,5]
#[1,] 21 22 23 24 25
#[2,] 20 7 8 9 10
#[3,] 19 6 1 2 11
#[4,] 18 5 4 3 12
#[5,] 17 16 15 14 13
Explanation:
The vector increments denotes the steps along the path of the increasing numbers. It's either 0/+1/-1 for unchanged/increasing/decreasing row and column indices. Important here is that these numbers do not differentiate between steps along columns and rows. This is managed by the vector idx_x - it masks out increments that are either along a row (TRUE) or a column (FALSE).
The last line takes into account R's indexing logic (matrix index increases along columns).
Edit:
As per request of the OP, here some more information about how the increments vector is calculated.
You always go two consecutive straight lines of equal length (row-wise or column-wise). The length, however, increases by 1 after you have walked twice. This corresponds to the x=2*(1:(max_x)) argument together with rep(1,rhs). The first two consecutive walks are in increasing column/row direction. Then follow two in negative direction and so on (alternating). This is accounted for by (-1)^(rhs/2+1).

Resources