Say I have a vector of ages of 100 trees. Then I age those trees up for 5, 10, 15, and 20 years into the future to create a matrix of tree ages for this year and four 5-year planning periods in the future.
But then, I decide to cut some of those trees (only 10 per planning period), documented in a matrix of T/F values where T is harvested and F is not (trees can't be harvested twice).
age.vec <- sample(x = 1:150, size = 100, replace = T) # create our trees
age.mat <- cbind(age.vec, age.vec+5, age.vec + 10, age.vec + 15, age.vec + 20) # grow them up
x.mat <- matrix(data = F, nrow = 100, ncol = 5) # create the empty harvest matrix
x.mat[cbind(sample(1:100, size = 50), rep(1:5, each = 10))] <- T # 10 trees/year harvested
So then, the ages of trees that are harvested become zero in that year:
age.mat[x.mat] <- 0
I then would like to age the harvested trees up again for the following periods. E.g. if a tree were harvested in the first planning period, in the second planning period (5 years later), I want the age of the tree to be 5, then in the third planning period (10 years later), I want the age of the tree to be 10. I have successfully implemented this in the following for loop:
for (i in 2:5){ # we don't need to calculate over the first year
age.mat[,i]<-age.mat[,i-1]+5L # add 5 to previous year
age.mat[x.mat[,i],i] <- 0L # reset age of harvested trees to zero
}
This works, however, it is clunky and slow. Is there a way to implement this faster (i.e. without the for loop)? It also is implemented within a function, which means that using "apply" actually slows things down, so it needs to be vectorized directly. This is something I'm iterating over thousands of times so speed is of the essence!
Thank you!
An alternative to the t(apply in #Jon Spring's answer is matrixStats::rowCumsums.
library(matrixStats)
n <- 1e4L
n10 <- n/10L
age.mat <- outer(sample(150, n, TRUE), seq(0, 20, 5), "+")
x.mat <- matrix(FALSE, n, 5) # create the empty harvest matrix
# sample harvests so that no tree is harvested twice
x.mat[matrix(c(sample(n, n/2L), sample(n10:(6L*n10 - 1L)) %/% n10), n/2L)] <- TRUE
f1 <- function(age, x) {
age[x[,1],] <- 0
for (i in 2:5){ # we don't need to calculate over the first year
age[,i] <- age[,i - 1] + 5L # add 5 to previous year
age[x[,i], i] <- 0L # reset age of harvested trees to zero
}
age
}
f2 <- function(age, x) {
age - rowCumsums(x*age)
}
microbenchmark::microbenchmark(f1 = f1(age.mat, x.mat),
f2 = f2(age.mat, x.mat),
check = "equal")
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> f1 294.4 530.2 1023.450 566.6 629.35 33222.8 100
#> f2 135.2 263.6 334.622 284.2 307.15 4343.6 100
This looks to be about 12x faster, based on testing with rbenchmark.
Here's an approach relying on the fact that harvesting a tree doesn't stop the passage of time, it just resets the clock. So we can think of a harvest as subtracting the harvest age from all future ages.
x.die <- x.mat * age.mat
x.dif <- t(apply(x.die, 1, cumsum))
age.mat2 <- age.mat - x.dif
x.die, by multiplying the harvests by the ages, we get the age at each harvest. The next line calculates the cumulative sum of these across each row, and finally we subtract those from the original ages.
I assume your "trees can't be harvested twice" means we won't ever see two TRUEs in one row of x.mat? My code won't work right if there were more than one harvest per tree location.
I found a way to do it! I implemented the idea of going backwards from #john-spring, where I created a matrix with the age of the stand at the harvested year filled in for the harvested year and all subsequent years, then subtracted that from my pre-made aged-up matrix. I built a function similar to what "fill" from tidyr or "na.locf" from zoo did (because they were too slow).
First I used arrayInd to determine the positions in the matrix of trees that were changed. I then used that to make another matrix that combined a repeat of each index row a number of times equal to the number of periods minus the period the tree was harvested in plus one, and a sequence vector of the same length that sequences from the period of the index number to the number of periods.
x.ind <- arrayInd(which(x.mat), dim(x.mat)) # gets index of row that was changed
x.new.ind <- cbind(rep(x.ind[,1], times = nper-x.ind[,2]+1), sequence(nvec = nper-x.ind[,2]+1, from = x.ind[,2]))
For example, if there was a tree harvested at position [4, 2], meaning the fourth tree was harvested in the second period, and we had 5 periods total, it would create a matrix:
[,1] [,2]
[1,] 4 2
[2,] 4 3
[3,] 4 4
[4,] 4 5
Then I made a vector with the ages of the trees that were harvested in the correct positions, and zeros in the rest of the positions (e.g. for our example, if the tree harvested was 100 years old, we would have a vector of 0 0 0 100 0 (if we had 5 trees)).
ages.vec <- vector(mode = "integer", length = nrow(age.mat))
ages.vec[x.ind[,1]]<- age.mat[x.ind]
I then multiplied this vector by a logical matrix with "T" at the row, column positions in the matrix above.
Continuing with the above example, we get:
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 100 100 100 100
[5,] 0 0 0 0 0
I then subtracted it from our current (already aged-up) ages matrix. So tree four was 95 100 105 110 115 and now it is 95 0 5 10 15.
new.ages.mat<- age.mat - replace(x.mat, x.new.ind, TRUE)*ages.vec
Though this might not be the most elegant solution, using microbenchmark, it is 90x faster than our for loop, and 3x faster than the lovely apply function that John created. I would put in the microbenchmark calls and results, but this post is long enough already! I know there's a better way to create the ages.vec and incorporate it, and am going to continue working on that, and will update this answer with my results!
This approach builds on the use of which used with arr.ind=TRUE to create a two column matrix the encodes the starting locations (in first column) and times (in second column) for new tree planting. It does violate the functional programming paradigm by using <<- to assign new values to age.mat` "in place".
fiveseq <- seq(0,20, by=5) # this way one only needs to call `seq` once
apply(which(x.mat, arr.ind=TRUE) , 1,
function(r) {age.mat[ r[1], r[2]:5] <<- fiveseq[ 1:(6-r[2])] } )
In summary, it locates the new locations and intervals and replaces the rest of that row with the right number of items from the sequence {0, 5, 10, 15, 20}
(I would be interested in seeing how this compares with the benchmarking framework that you have already established.)
You can use apply to work on each vector rowwise, then use some logic within the function to adjust the values.
Should be about 4 times faster
age.mat |>
apply(1, \(x) {
if(any(x == 0 & (which(x == 0) != length(x)))) {
x[which(x == 0):length(x)] <- (0:(length(x) - which(x == 0))) * 5
x
} else x
}) |> t()
[,1] [,2] [,3] [,4] [,5]
[1,] 101 0 5 10 15
[2,] 55 60 65 70 75
[3,] 23 28 33 0 5
[4,] 0 5 10 15 20
[5,] 23 28 33 0 5
[6,] 84 0 5 10 15
[7,] 52 57 62 0 5
[8,] 26 31 36 41 0
[9,] 114 119 124 129 0
[10,] 33 38 43 48 53
[11,] 144 149 154 159 164
[12,] 19 24 29 34 39
[13,] 43 48 53 58 63
[14,] 69 74 79 84 89
[15,] 98 103 108 113 118
[16,] 110 115 120 125 130
[17,] 8 13 18 23 28
[18,] 16 21 26 31 36
[19,] 1 6 11 16 21
[20,] 60 65 0 5 10
Related
I would like to solve a large mixed integer programming problem, and I have tried with R, package lpSolveAPI. The problem is large - 410 variables each of which can be either 0 or 1, and about 49422 constraints. I have tried to let it run for 1,5 days, but once I try to stop it, it says that R needs to be terminated. The same happens if I let it run for short time, say 15 minutes, and then try to stop it by clicking on the red button. Since this happens, I am not sure whether there is something wrong with my computer or whether such problem is way too large for a computer. When it runs, it uses maybe 20% of the CPU power and about 70% of memory. My computer is a 2022 Lenovo X1 Yoga with i7 2.80GHz processor and 16GB of ram.
The problem itself is constructed in the following way:
library(lpSolveAPI)
#A has length of 410. No constraints to begin with
lprec <- make.lp(0, 410)
set.objfn(lprec, round(A,0))
lp.control(lprec, sense="max")
set.type(lprec,c(1:A),"binary")
#Defining the constraints with a for loop. Will not go into details, but it adds 49422 constraints
for (){
...
add.constraint(lprec, MyConstraint, "<=", 1)
...
}
lprec
#This says: Model name: a linear program with 410 decision variables and 49422 constraints
solve(lprec)
The vector "MyConstraint" is different in every iteration, but it has length 410 where 408 elements are 0 and two elements are 1.
That is, I have 410 objects in total, and I want to choose a set of those objects (1 if an object chosen and 0 otherwise) such that the objective function is maximized. However, some pairs of objects are not allowed, and so each of the 49422 constraints specifies which two objects cannot be chosen chosen at once: each constraint says that the sum cannot be above 1.
So, my question is if there is any way to solve this? If not, how large can such problem be in order to be solvable?
Thank you!
EDIT: ---------------------------------------------
In the comments I was asked to provide an example, so here is it. A similar, but much smaller problem. Suppose we have 7 different objects, and these can allocated into 5 groups. Let us define the groups and the associated savings denoted by A:
MyGroups <- c(1,0,0,0,1,0,0,
0,0,1,1,0,0,0,
0,0,1,0,0,1,0,
0,0,0,1,0,1,0,
0,0,1,1,0,1,0)
MyGroups <- matrix(MyGroups,nrow=5,ncol=7,byrow=TRUE)
rownames(MyGroups) <- paste0("Group", 1:5)
colnames(MyGroups) <- paste0("Object", 1:7)
A=c(50,30,100,100,200)
That is, group 1 consists of Object 1 and Object 5 (denoted by the first row in the matrix MyGroups). Such a group will give a saving of 50. Objective: to maximize the total saving by choosing the right groups. Problem: each object can only be a part of one group. For example, if group 2 is implemented, then group 3 cannot be implemented, since both groups require object 3. Here we see that the optimal solution is to choose Group 1 and Group 5, which will give a total saving of 50+200=250. I want to be able to find this for a bigger problem. So, first I can create a matrix with constraints where specifies which 2 groups cannot be implemented at the same time.
lprec2 <- make.lp(0, 5)
set.objfn(lprec2, A)
lp.control(lprec2, sense="max")
set.type(lprec2,c(1:5),"binary")
#Defining the constraints
for (i in 1:(5-1)){
for (j in (i+1):5) {
if(max(colSums(MyGroups[c(i,j),]))>1){
#group i and group j cannot be together. Add constraint
MyConstraint=integer(5)
MyConstraint[c(i,j)]=1
add.constraint(lprec2, MyConstraint, "<=", 1)
}
}
}
lprec2
This gives the following mixed integer problem:
When I solve it, then the solution is:
solve(lprec2)
get.objective(lprec2)
get.variables(lprec2)
Which gives 250 and (1 0 0 0 1) respectively.
In the original problem I have 410 possible groups, implying 410 decision variables. The number of constraints is 49422, but in all rows there are exactly two 1 and the remaining are 0.
If you could help me to solve such a problem, I would be happy :-). Thanks!
Here is the model formulated using ompr:
MyGroups <- c(1,0,0,0,1,0,0,
0,0,1,1,0,0,0,
0,0,1,0,0,1,0,
0,0,0,1,0,1,0,
0,0,1,1,0,1,0)
MyGroups <- matrix(MyGroups,nrow=5,ncol=7,byrow=TRUE)
ngroups <- nrow(MyGroups)
nobjects <- ncol(MyGroups)
coeffs <- c(50, 30, 100, 100, 200)
model <- MIPModel() %>%
add_variable(group[i], i=1:ngroups, type = 'binary') %>%
add_variable(assign[i, j], i=1:ngroups, j=1:nobjects, type = 'binary', MyGroups[i, j] == 1) %>%
set_objective(sum_over(coeffs[i] * group[i], i=1:ngroups, sense = 'max')) %>%
add_constraint(sum_over(assign[i, j], i=1:ngroups, MyGroups[i, j] == 1) <= 1, j=1:nobjects) %>%
add_constraint(assign[i, j] == group[i], i=1:ngroups, j=1:nobjects, MyGroups[i, j] == 1) %>%
add_constraint(sum_over(group[i], i=1:ngroups) <= 2)
result <- solve_model(model, with_ROI("glpk", verbose = TRUE))
result
<SOLVER MSG> ----
GLPK Simplex Optimizer, v4.47
16 rows, 16 columns, 35 non-zeros
* 0: obj = 0.000000000e+000 infeas = 0.000e+000 (11)
* 12: obj = 2.500000000e+002 infeas = 0.000e+000 (3)
OPTIMAL SOLUTION FOUND
GLPK Integer Optimizer, v4.47
16 rows, 16 columns, 35 non-zeros
16 integer variables, all of which are binary
Integer optimization begins...
+ 12: mip = not found yet <= +inf (1; 0)
+ 13: >>>>> 2.500000000e+002 <= 2.500000000e+002 0.0% (1; 0)
+ 13: mip = 2.500000000e+002 <= tree is empty 0.0% (0; 1)
INTEGER OPTIMAL SOLUTION FOUND
<!SOLVER MSG> ----
result
Status: success
Objective value: 250
ompr is a model management wrapper around the ROI package. It using an algebraic paradigm like GAMS or AMPL but has less embedded logic to simplify the syntax. Although with ompr, you can test other solvers that ROI offers as plug-ins: http://roi.r-forge.r-project.org/
Some are free, others like Mosek, CPLEX and Gurobi are commercial products. Suggest running a large subset problem and checking the relative performance of the different solvers.
Also note that your toy problem is degenerate. Group(1, 3, 4) is also a solution. I added an additional constraint that can limit the number of groups selected. If your objective function coefficients are integer values the formulation may have many degenerate solutions, a simple test is to add a small random epsilon to each of the coefficients to eliminate degeneracy and see if that improves performance.
Keying in on a couple of the OP's statements:
In the original problem I have 410 possible groups, implying 410
decision variables. The number of constraints is 49422, but in all
rows there are exactly two 1 and the remaining are 0.
and
I have 32 objects.
It seems like this can be formulated as a one-sided matching problem with a utility matrix instead of a preference matrix, which can be solved using the matchingR package.
The problem is set up using a savings matrix where the row and column indices refer to objects and each cell (and its mirror across the main diagonal) represents the savings for a group composed of two objects (the row and column).
First a smaller example with 7 objects and 15 groups:
library(matchingR) # for the roommate function
library(Rfast) # for the rowSort function
set.seed(379327748)
n <- 7L # number of objects
m <- matrix(0L, n, n) # initialize the savings matrix
# specify the savings for the 15 groups
m[which(lower.tri(m))[sample(n*(n - 1L)/2L, 15)]] <- sample(1e3, 15, TRUE)
# make the savings matrix symmetric
m[upper.tri(m)] <- t(m)[upper.tri(m)]
# the savings matrix: each cell refers to the savings for the row/column pair
# it is symmetric: 1 paired with 2 has the same savings as 2 paired with 1
m
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 0 692 429 767 133 434 619
#> [2,] 692 0 0 997 146 801 0
#> [3,] 429 0 0 214 966 683 0
#> [4,] 767 997 214 0 835 0 0
#> [5,] 133 146 966 835 0 888 513
#> [6,] 434 801 683 0 888 0 0
#> [7,] 619 0 0 0 513 0 0
# use the roommate function to get the optimal pairings
groups <- unique(rowSort(cbind(1:n, roommate(utils = m))))
# remove objects with no pair in the optimal solution (NAs happen only with an
# odd number of objects)
groups <- groups[!is.na(groups[,2]),]
groups <- groups[m[groups] > 0,]
# show the solution
groups
#> [,1] [,2]
#> [1,] 1 7
#> [2,] 2 4
#> [3,] 3 5
c(savings = sum(m[groups]))
#> savings
#> 2582
Now an example with 32 objects and 410 groups. The solution is provided almost instantly.
n <- 32L
m <- matrix(0L, n, n)
m[which(lower.tri(m))[sample(n*(n - 1L)/2L, 410)]] <- sample(1e3, 410, TRUE)
m[upper.tri(m)] <- t(m)[upper.tri(m)]
system.time(groups <- unique(rowSort(cbind(1:n, roommate(utils = m)))))
#> user system elapsed
#> 0 0 0
groups <- groups[m[groups] > 0,]
groups
#> [,1] [,2]
#> [1,] 1 15
#> [2,] 2 18
#> [3,] 3 32
#> [4,] 4 19
#> [5,] 5 30
#> [6,] 6 9
#> [7,] 7 12
#> [8,] 8 14
#> [9,] 10 29
#> [10,] 11 24
#> [11,] 13 16
#> [12,] 17 20
#> [13,] 21 27
#> [14,] 22 31
#> [15,] 23 26
#> [16,] 25 28
c(savings = sum(m[groups]))
#> savings
#> 14369
# check that each object is used only once
max(tabulate(groups, 32L))
#> [1] 1
I have a question about creating vectors. If I do a <- 1:10, "a" has the values 1,2,3,4,5,6,7,8,9,10.
My question is how do you create a vector with specific intervals between its elements. For example, I would like to create a vector that has the values from 1 to 100 but only count in intervals of 5 so that I get a vector that has the values 5,10,15,20,...,95,100
I think that in Matlab we can do 1:5:100, how do we do this using R?
I could try doing 5*(1:20) but is there a shorter way? (since in this case I would need to know the whole length (100) and then divide by the size of the interval (5) to get the 20)
In R the equivalent function is seq and you can use it with the option by:
seq(from = 5, to = 100, by = 5)
# [1] 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
In addition to by you can also have other options such as length.out and along.with.
length.out: If you want to get a total of 10 numbers between 0 and 1, for example:
seq(0, 1, length.out = 10)
# gives 10 equally spaced numbers from 0 to 1
along.with: It takes the length of the vector you supply as input and provides a vector from 1:length(input).
seq(along.with=c(10,20,30))
# [1] 1 2 3
Although, instead of using the along.with option, it is recommended to use seq_along in this case. From the documentation for ?seq
seq is generic, and only the default method is described here. Note that it dispatches on the class of the first argument irrespective of argument names. This can have unintended consequences if it is called with just one argument intending this to be taken as along.with: it is much better to use seq_along in that case.
seq_along: Instead of seq(along.with(.))
seq_along(c(10,20,30))
# [1] 1 2 3
Use the code
x = seq(0,100,5) #this means (starting number, ending number, interval)
the output will be
[1] 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
[17] 80 85 90 95 100
Usually, we want to divide our vector into a number of intervals.
In this case, you can use a function where (a) is a vector and
(b) is the number of intervals. (Let's suppose you want 4 intervals)
a <- 1:10
b <- 4
FunctionIntervalM <- function(a,b) {
seq(from=min(a), to = max(a), by = (max(a)-min(a))/b)
}
FunctionIntervalM(a,b)
# 1.00 3.25 5.50 7.75 10.00
Therefore you have 4 intervals:
1.00 - 3.25
3.25 - 5.50
5.50 - 7.75
7.75 - 10.00
You can also use a cut function
cut(a, 4)
# (0.991,3.25] (0.991,3.25] (0.991,3.25] (3.25,5.5] (3.25,5.5] (5.5,7.75]
# (5.5,7.75] (7.75,10] (7.75,10] (7.75,10]
#Levels: (0.991,3.25] (3.25,5.5] (5.5,7.75] (7.75,10]
I am attempting to write my own function for total sum of square, within sum of square, and between sum of square in R Studio for my own implementation of k-means.
I've successfully written the function for within sum of square, but I'm having difficulty with total sum of square (and thus bss). The result I get is significantly larger than what R's own kmeans function computes. I'm confused because I am following exactly what formulas provide. Here is my data:
A =
36 3
73 3
30 3
49 3
47 11
47 11
0 7
46 5
16 3
52 4
0 8
21 3
0 4
57 6
31 5
0 6
40 3
31 5
38 4
0 5
59 4
61 6
48 7
29 2
0 4
19 4
19 3
48 9
48 4
21 5
where each column is a feature. This is the function I've created thus far for tss:
tot_sumoSq <- function(data){
avg = mean( as.matrix(data) )
r = matrix(avg, nrow(data), ncol(data))
tot_sumoSq = sum( (data - r)^2 )
}
I receive the result 24342.4, but R gives 13244.8. Am I completely missing something?
The latter value is calculated using the column means. If you use this for calculating the means, you'll get the same answer.
avg = colMeans(data)
r = matrix(avg, nrow(data), ncol(data), byrow=T)
[1] 13244.8
May be there are something wrong in your program. You subtract a matrix from a data frame. Use the following -
tot_sumoSq <- function(data){
data = as.matrix(data)
x = sum((data - mean(data))^2)
return(x)
}
From my side it gives the correct answer.
I found a solution to my issue by combining solutions provided by the first two commentators. I see what my previous mistake was and would like to clear any confusion for future scientists.
tot_sumoSq <- function(data){
avg = colMeans(data)
r = matrix(avg, nrow(data), ncol(data), byrow = T)
data = as.matrix(data)
return( sum( (data - r)^2 ) )
}
Each column is the entire sample for different features, so when we calculate the mean for each column, it is the mean of means for the entire sample for one feature. My conceptual mistake earlier was to combine both features to calculate an overall mean.
I am trying to revert the indexing of a matrix in R. The following example illustrates my problem:
#sample data:
set.seed(21)
m <- matrix(sample(100,size = 100),10,10)
# sorting:
t(apply(m,1,order))
# new exemplary order after sorting:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 3 7 10 6 5 9 2 4 1 8
[2,] 1 6 4 7 3 9 5 8 2 10
[3,] 2 5 8 10 4 7 9 1 3 6
[4,] 8 1 9 2 7 3 4 6 10 5
[5,] 6 9 5 2 7 3 10 4 8 1
[6,] 2 7 4 8 6 9 3 10 1 5
[7,] 1 6 4 10 3 2 7 8 9 5
[8,] 1 2 6 9 3 10 5 7 4 8
[9,] 9 4 5 7 10 2 8 3 1 6
[10,] 6 8 4 3 2 1 5 10 7 9
# we can create m2 with the above sorting. We also add 1000 to all values
m2 <- t(apply(m,1,function(x){
x[order(x)]
})) + 1000
# the next step would be to obtain the original arrangement of columns again, as described below.
After the sorting of my data we have the following situation: In row 1, the 3rd column (of matrix m2) is mapped to the original first column (of matrix m), the 7th column is mapped to the original second column, the 10th column to the original 3rd column, and so on.
My question is as follows: Can I somehow revert this mapping in R? What I mean by this is again for row 1, move the 1st column (of m2) to the position of the 3rd column (of m), then move the 2nd column to the position of the 7th, move the 3rd to the position of the 10th, and so on.
In the end what I try to achieve is to sort my data but save the existing arrangement of the columns somehow, so later, that means after some transformations of my data, I can rearrange them to the original ordering again. When I use the usual sorting algortihms in R, I am losing the old positioning of my columns. Of course most of the time you would not need those anymore, but atm I do need them.
Background
I think it will help to examine the effect of the order() and rank() functions on a simple vector. Consider:
x <- c('c','b','d','b','a');
seq_along(x);
## [1] 1 2 3 4 5
order(x);
## [1] 5 2 4 1 3
rank(x); ## default is ties.method='average'
## [1] 4.0 2.5 5.0 2.5 1.0
rank(x,ties.method='first');
## [1] 4 2 5 3 1
rank(x,ties.method='last'); ## available from 3.3.0
## [1] 4 3 5 2 1
rank(x,ties.method='random'); ## we can ignore this one, obviously
## [1] 4 2 5 3 1
rank(x,ties.method='max');
## [1] 4 3 5 3 1
rank(x,ties.method='min');
## [1] 4 2 5 2 1
(I used character values to demonstrate that these principles and algorithms can apply to any (comparable) data type, not just numeric types. But obviously this includes numeric types.)
The order() function returns a vector that is the same length as the input vector. The order values represent a reordering of the input indexes (which are shown above courtesy of seq_along()) in such a way that when the input vector is indexed with the order vector, it will be sorted (according to the chosen sort method, which (if not explicitly overridden by a method argument), is radixsort for integer, logical, and factor, shellsort otherwise, and takes into account the collation order of the current locale for character values when not using radixsort). In other words, for an element of the result vector, its value gives the input index of the element in the input vector that should be moved to that position in order to sort it.
To try to put it even more plainly, an element of the order vector basically says "place the input vector element with this index in my position". Or, in a slightly more generic way (which will dovetail with the parallel description of rank()):
order element: the input vector element with this index sorts into my position.
In a sense, rank() does the inverse of what order() does. Its elements correspond to the elements of the input vector by index, and its values give a representation of the sort order of the corresponding input element (with tiebreaking behavior depending on the ties.method argument; this contrasts with order(), which always preserves the incoming order of ties, equivalent to ties.method='first' for rank()).
To use the same language structure that I just used for order(), which is the plainest manner of expression I can think of:
rank element: the input vector element in my position sorts into this index.
Of course, this description is only perfectly accurate for ties.method='first'. For the others, the destination index for ties will actually be the reverse of the incoming order (for 'last'), the lowest index of the duplicate set (for 'min'), the highest (for 'max'), the average (for 'average', which is actually the default), or random (for 'random'). But for our purposes, since we need to mirror the proper sort order as per order() (and therefore sort(), which uses order() internally), let's ignore the other cases from this point forward.
I've thought of one final way to articulate the behaviors of the order() and rank() functions: order() defines how to pull elements of the input vector into a sorted order, while rank() defines how to push elements of the input vector into a sorted order.
This is why indexing the input vector with the results of order() is the correct way to sort it. Indexing a vector is inherently a pulling operation. Each respective index vector element effectively pulls the input vector element that is stored at the index given by that index vector element into the position occupied by that index vector element in the index vector.
Of course, the "push vector" produced by rank() cannot be used in the same way as the "pull vector" produced by order() to directly sort the input vector, since indexing is a pull operation. But we can ask, is it in any way possible to use the push vector to sort the input vector? Yes, I've thought of how this can be done. The solution is index-assigning, which is inherently a push operation. Specifically, we can index the input vector with the push vector as the (lvalue) LHS and assign the input vector itself as the RHS.
So, here are the three methods you can use to sort a vector:
x[order(x)];
[1] "a" "b" "b" "c" "d"
sort(x); ## uses order() internally
[1] "a" "b" "b" "c" "d"
y <- x; y[rank(y,ties.method='first')] <- y; y; ## (copied to protect x, but not necessary)
[1] "a" "b" "b" "c" "d"
An interesting property of the rank() function with ties.method='first' is that it is idempotent. This is because, once you've produced a rank vector, ranking it again will not change the result. Think about it: say the first element ranks 4th. Then the first call will produce a 4 in that position. Running rank() again will again find that it ranks 4th. You don't even need to specify ties.method anymore for the subsequent calls to rank, because the values will have become distinct on the first call's (potential) tiebreaking.
rank(x,ties.method='first');
## [1] 4 2 5 3 1
rank(rank(x,ties.method='first'));
## [1] 4 2 5 3 1
rank(rank(rank(x,ties.method='first')));
## [1] 4 2 5 3 1
y <- rank(x,ties.method='first'); for (i in seq_len(1e3L)) y <- rank(y); y;
## [1] 4 2 5 3 1
On the other hand, order() is not idempotent. Repeatedly calling order() has the interesting effect of alternating between the push and pull vectors.
order(x);
## [1] 5 2 4 1 3
order(order(x));
## [1] 4 2 5 3 1
order(order(order(x)));
## [1] 5 2 4 1 3
Think about it: if the last element sorts 1st, then the first call to order() will pull it into the 1st position by placing its index (which is largest of all indexes) into the 1st position. The second call to order() will identify that the element in the 1st position is largest in the entire vector, and thus will pull index 1 into the last position, which is equivalent to ranking the last element with its rank of 1.
Solutions
Based on all of the above, we can devise 3 solutions to your problem of "desorting", if you will.
For input, let's assume that we have (1) the input vector x, (2) its sort order o, and (3) the sorted and possibly transformed vector xs. For output we need to produce the same vector xs but desorted according to o.
Common input:
x <- c('c','b','d','b','a'); ## input vector
o <- order(x); ## order vector
xs <- x[o]; ## sorted vector
xs <- paste0(xs,seq_along(xs)); ## somewhat arbitrary transformation
x;
## [1] "c" "b" "d" "b" "a"
o;
## [1] 5 2 4 1 3
xs;
## [1] "a1" "b2" "b3" "c4" "d5"
Method 1: pull rank()
Since the order and rank vectors are effectively inverses of each other (i.e. pull and push vectors), one solution is to compute the rank vector in addition to the order vector o, and use it to desort xs.
xs[rank(x,ties.method='first')];
## [1] "c4" "b2" "d5" "b3" "a1"
Method 2: pull repeated order()
Alternatively, instead of computing rank(), we can simply use a repeated order() call on o to generate the same push vector, and use it as above.
xs[order(o)];
## [1] "c4" "b2" "d5" "b3" "a1"
Method 3: push order()
I was thinking to myself that, since we already have the order vector o, we really shouldn't have to go to the trouble of computing another order or rank vector. Eventually I realized that the best solution is to use the pull vector o as a push vector. This accomplishes the desorting objective with the least work.
xs[o] <- xs;
xs;
## [1] "c4" "b2" "d5" "b3" "a1"
Benchmarking
library(microbenchmark);
desort.rank <- function(x,o,xs) xs[rank(x,ties.method='first')];
desort.2order <- function(x,o,xs) xs[order(o)];
desort.assign <- function(x,o,xs) { xs[o] <- xs; xs; };
## simple test case
x <- c('c','b','d','b','a');
o <- order(x);
xs <- x[o];
xs <- paste0(xs,seq_along(xs));
ex <- desort.rank(x,o,xs);
identical(ex,desort.2order(x,o,xs));
## [1] TRUE
identical(ex,desort.assign(x,o,xs));
## [1] TRUE
microbenchmark(desort.rank(x,o,xs),desort.2order(x,o,xs),desort.assign(x,o,xs));
## Unit: microseconds
## expr min lq mean median uq max neval
## desort.rank(x, o, xs) 106.487 122.523 132.15393 129.366 139.843 253.171 100
## desort.2order(x, o, xs) 9.837 12.403 15.66990 13.686 16.251 76.122 100
## desort.assign(x, o, xs) 1.711 2.567 3.99916 3.421 4.277 17.535 100
## scale test case
set.seed(1L);
NN <- 1e4; NE <- 1e5; x <- sample(seq_len(NN),NE,T);
o <- order(x);
xs <- x[o];
xs <- xs+seq(0L,NE-1L)/NE;
ex <- desort.rank(x,o,xs);
identical(ex,desort.2order(x,o,xs));
## [1] TRUE
identical(ex,desort.assign(x,o,xs));
## [1] TRUE
microbenchmark(desort.rank(x,o,xs),desort.2order(x,o,xs),desort.assign(x,o,xs));
## Unit: milliseconds
## expr min lq mean median uq max neval
## desort.rank(x, o, xs) 36.488185 37.486967 39.89157 38.613191 39.145405 85.849143 100
## desort.2order(x, o, xs) 16.764414 17.262630 18.10341 17.443527 19.014296 28.338835 100
## desort.assign(x, o, xs) 1.457014 1.498495 1.82893 1.527363 1.592151 4.255573 100
So, clearly the index-assignment solution is the best.
Demo
Below is a demonstration of how this solution can be used for your sample input.
I honestly think that a simple for-loop over the rows is preferable to an apply() call in this case, since you can modify the matrix in-place. If you need to preserve the sorted intermediate matrix, you can copy it before applying this desorting operation.
## generate input matrix
set.seed(21L); m <- matrix(sample(seq_len(100L)),10L); m;
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 79 61 1 66 40 39 2 86 44 26
## [2,] 25 84 49 35 67 32 36 70 50 100
## [3,] 69 6 90 51 30 92 65 34 68 42
## [4,] 18 54 72 73 85 75 55 15 27 77
## [5,] 93 16 23 58 9 7 19 64 8 46
## [6,] 88 4 60 13 98 47 5 29 56 80
## [7,] 10 45 43 14 95 11 74 76 83 38
## [8,] 17 24 57 82 63 28 71 87 53 59
## [9,] 91 41 81 21 22 94 33 62 12 37
## [10,] 78 52 48 31 89 3 97 20 99 96
## sort each row, capturing sort order in rowwise order matrix
o <- matrix(NA_integer_,nrow(m),ncol(m)); ## preallocate
for (ri in seq_len(nrow(m))) m[ri,] <- m[ri,o[ri,] <- order(m[ri,],decreasing=T)];
## whole-matrix transformation
## embed row index as tenth digit, column index as hundredth (arbitrary)
m <- m+(row(m)-1L)/nrow(m)+(col(m)-1L)/ncol(m)/10;
## desort
for (ri in seq_len(nrow(m))) m[ri,o[ri,]] <- m[ri,]; m;
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 79.01 61.03 1.09 66.02 40.05 39.06 2.08 86.00 44.04 26.07
## [2,] 25.19 84.11 49.15 35.17 67.13 32.18 36.16 70.12 50.14 100.10
## [3,] 69.22 6.29 90.21 51.25 30.28 92.20 65.24 34.27 68.23 42.26
## [4,] 18.38 54.36 72.34 73.33 85.30 75.32 55.35 15.39 27.37 77.31
## [5,] 93.40 16.46 23.44 58.42 9.47 7.49 19.45 64.41 8.48 46.43
## [6,] 88.51 4.59 60.53 13.57 98.50 47.55 5.58 29.56 56.54 80.52
## [7,] 10.69 45.64 43.65 14.67 95.60 11.68 74.63 76.62 83.61 38.66
## [8,] 17.79 24.78 57.75 82.71 63.73 28.77 71.72 87.70 53.76 59.74
## [9,] 91.81 41.84 81.82 21.88 22.87 94.80 33.86 62.83 12.89 37.85
## [10,] 78.94 52.95 48.96 31.97 89.93 3.99 97.91 20.98 99.90 96.92
rank is the complement to order(). You need to save the original rank() and you can use that to get back to the original ordering after rearranging with order().
I think your example is overcomplicated (far from minimal!) by putting things in a matrix and doing extra stuff. Because you are applying functions at the row-level you just need to solve it for a vector. An example:
set.seed(47)
x = rnorm(10)
xo = order(x)
xr = rank(x)
x[xo][xr] == x
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
In your case, you can perform whatever transformations you want on the ordered vector x[xo], then index the result by [xr] to get back to the original ordering.
sorted_result = x[xo] + c(1, diff(x[xo])) # some order-dependent transformation
final_result = sorted_result[xr] # back to original ordering
If there's a possibility of ties, you'll want to use ties.method = 'first' in the rank() call.
Taking this back to the matrix example:
m3 = t(apply(m, 1, function(x) {
xo = order(x)
xr = rank(x, ties.method = 'first')
(x[xo] + 1000)[xr] # add 1000 to sorted matrix and then "unsort"
}))
# check that it worked
all(m3 == (m + 1000))
# [1] TRUE
I have a question about optimizing program use across multiple rooms in a building using a couple constraints. My question is how can I ask for the optimum program in all 4 classrooms based on the data and constraints at the same time?
The following is a small sample of my data.
Program sf 101 201 301 401
(1) Offices 120 6 6 5 5
(2) Gallery 1000 5 5 4 4
(3) Reception / Greeter 300 6 6 5 7
(4) Studio / Classroom 800 6 6 5 5
101, 201, 301, and 401 represent 4 rooms in a building.
Program is the potential room use.
sf represents the maximum square feet of the intended program (one constraint).
The data represents how many criteria the intended program matches with attributes of each room (this is the number I want to maximize).
I would also like to place constraints that would allow me to say I only want a certain number of Offices, Galleries, etc. in the building.
e.g., I want 1 Reception/Greeter and 2 Offices, and the last room can be filled by best available matches. Room 101 has a maximum sf of 150, Room 201 has a maximum sf of 250 sf, Room 301 has a maximum sf of 1500, and Room 401 has a maximum sf of 500 (these constraints are not in my data frame because I couldn't think of a good way to do include them).
This example should return 101 = Offices, 201 = Offices, 301 = one of the three excluding Gallery, and 401 = Reception / Greeter.
Update:
Objective should be something like this (I want them all maximized):
obj <- data$101, data$201, data$301, data$401 (and probably data$sf too)
Then I'm not really sure how to write the constraints but they would be this:
data$101
data$sf <= 150
number of solutions from this column should be 1
data$201
data$sf <= 250
number of solutions from this column should be 1
...
data$401
data$sf <= 500
number of solutions from this column should be 1
And then finally somehow restrict the number of "Offices", "Gallery", "Reception / Greeter", "Studio / Classroom".
Maybe something like:
as.numeric(data$Program %in% c("(1) Offices") == 1
Hopefully this clarifies some things.
It sounds like you are assigning programs (rows in data) to rooms (columns in data) such that you maximize the value of the assignment. You can only assign a program to a room if the room is big enough and you can only assign one program to a room, though the same program can be assigned to multiple rooms (e.g. you could assign "Office" to both room 101 and 201). Therefore, your data really consists of program sizes, room sizes, and objective values:
program.size <- c(120, 1000, 300, 800)
room.size <- c(150, 250, 1500, 500)
(obj.vals <- matrix(c(6, 5, 6, 6, 6, 5, 6, 6, 5, 4, 5, 5, 5, 4, 7, 5), nrow=4))
# [,1] [,2] [,3] [,4]
# [1,] 6 6 5 5
# [2,] 5 5 4 4
# [3,] 6 6 5 7
# [4,] 6 6 5 5
A simple way to block assigning a program to a room that is too small would be to set the objective value for such an assignment to a low value (I'll use 0 here):
(obj.adj <- obj.vals * outer(program.size, room.size, "<="))
# [,1] [,2] [,3] [,4]
# [1,] 6 6 5 5
# [2,] 0 0 4 0
# [3,] 0 0 5 7
# [4,] 0 0 5 0
Now, you can approach this problem using integer programming, defining a variable x_pr that takes value 1 if program p is assigned to room r and 0 otherwise. You can code up the objective and constraints pretty easily using the lpSolve package in R:
# Convenience variables
nr <- nrow(obj.adj)
nc <- ncol(obj.adj)
# Model
library(lpSolve)
mod <- lp("max",
as.vector(obj.adj),
t(1*sapply(1:nc, function(x) rep(1:nc == x, each=nr))),
rep("<=", nc),
rep(1, nc),
all.bin=TRUE)
matrix(mod$solution, nrow=nr)
# [,1] [,2] [,3] [,4]
# [1,] 1 1 0 0
# [2,] 0 0 0 0
# [3,] 0 0 0 1
# [4,] 0 0 1 0
Now we've assigned "Office" to rooms 101 and 201, "Studio/Classroom" to room 301, and "Reception/Greeter" to room 401.
It's worth noting that this program could easily be solved by selecting the program with the largest value for each room in obj.adj, so the use of lpSolve is only warranted if you have more complicated constraints than the ones mentioned in the question.