I am trying to formulate a shelf-optimization integer programming algorithm in lpsolveAPI and wish to add a constraint whereby the same number of each product is on each selected shelf (S):
f
My difficulty is accessing and using the sums as constraints (specifically Xij)
I can reformulate to make it linear without too much problems (please excuse the pseudocode):
sum(X_ij)*F_ij - sum(F_ij) = 0
This operation could affect the choice of X itself, thus I need it to be dynamic (otherwise I could just change the values post-solve) How can I access these values, or code the F values to be equal?
The linear program creates a binary solution through a series of constraints to place products on shelves (it may place on one shelf or two shelves at the moment), there are four shelves. Then there is a second set which allows a number of products on those shelves which are non-zero, constrained by the width of the products against the length of the shelves with a given maximum of products on all shelves (8 for most of them, though this is somewhat arbitrary), maximising the product profit. All this works as expected. However, I wish to add a constraint such that the number of products on two or more shelves are the same i.e. four on one and four on the other. Given that the number of shelves used can be 1 or 2, I cannot simply divide the values. Further, as which shelves are occupied are decided by constraints, I cannot simply use P1S1 = P1S2 (unless I could select the occupied shelves, which I am failing to do)
Here is a minimal example of what I am trying to do (please excuse the inelegant code as I am doing this for the first time) the dataset is here:
library(lpSolveAPI)
shelves <- data.frame(Sl_i = c(151, 200, 180, 218),
Sh_i = c(30, 30, 30, 36))
datatable <- read.csv("~/Desktop/sales/datatable.txt", sep="")
S = 4 # number of shelves
P = 40 # number of products
Shelf_choice <-
make.lp(0, nrow(mydata) * 2) # create the lp object with decision variables == longitude of data.frame
#### SET OBJECTIVE FUNCTION ####
#### Set controls for the model ####
lp.control(Shelf_choice,
sense = "max",
timeout = 10,
presolve = "none") ## timeout prevents getting stuck
set.objfn(Shelf_choice, c(rep(rep(0, nrow(
mydata
)), 1), mydata$Pu_j)) # maximize profit (Pu_j)
set.type(Shelf_choice, 1:nrow(mydata), "binary") # present on shelf or not
set.type(Shelf_choice, 1:nrow(mydata) + nrow(mydata), "integer") # number of product j on shelf
### Assure that each product appears on minimum number of shelves (1 in this case)
Add_productShelf_constraint <- function (prod_index) {
cargo_cols <-
(0:(S - 1)) * P + prod_index # # index of products by column (eg. 1,41,81,121)
add.constraint(
Shelf_choice,
rep(1, S),
# repeat value the same number of times as shelves
indices = cargo_cols,
type = ">=",
rhs = mydata$smin_j[prod_index]
) # value of minimum number of shelves
}
lapply(1:P, Add_productShelf_constraint) # list apply this for every product
### Assure that product appears no more than the number of shelves permitted (2 in this case)
Add_productShelfMAX_constraint <- function (prod_index) {
cargo_cols <-
(0:(S - 1)) * P + prod_index # index of products by column (eg. 1,41,81,121)
add.constraint(
Shelf_choice,
rep(1, S),
# repeat value the same number of times as shelves
indices = cargo_cols,
type = "<=",
rhs = mydata$smax_j[prod_index]
) # value of minimum number of shelves
}
lapply(1:P, Add_productShelfMAX_constraint) # list apply this for every product
### Third Constraint: Products too tall for a shelf are excluded
Add_height_constraint <-
function (prod_index) {
# this needs to be improved
add.constraint(Shelf_choice,
1,
indices = prod_index,
type = "=",
rhs = 0)
}
lapply(which(mydata$height == 0), Add_height_constraint) # Here we select the colums which have 0 (don't fit), and set the value to 0
## Products are on consecutive shelves - this currently only works for two shelves
Add_nextshelf_constraint <- function (prod_index) {
mat1 <- combn(1:S, 2)[, which(combn(1:S, 2)[2, ] - combn(1:S, 2)[1, ] != 1)]
cargo_cols <- (0:(S - 1)) * P + prod_index
result <- matrix(cargo_cols[mat1], nrow = 2)
for (i in 1:ncol(result)) {
add.constraint(
Shelf_choice,
c(1, 1),
indices = result[, i],
type = "<=",
rhs = 1
)
}
}
lapply(1:P, Add_nextshelf_constraint)
### Product facings only appear on selected shelves (where Xij = 1)
Add_FF_constraint1 <- function (prod_index) {
Y01col <- prod_index
print(Y01col)
FF_col <- prod_index + nrow(mydata)
add.constraint(
Shelf_choice,
c(1, -100),
indices = c(FF_col, Y01col),
type = "<=",
rhs = 0
)
}
lapply(1:nrow(mydata), Add_FF_constraint1) #
Add_FF_constraint2 <- function (prod_index) {
Y01col <- prod_index
FF_col <- prod_index + nrow(mydata)
add.constraint(
Shelf_choice,
c(1, -1),
indices = c(FF_col, Y01col),
type = ">=",
rhs = 0
)
}
lapply(1:nrow(mydata), Add_FF_constraint2) #
#### Sum of product widths on shelves does not exceed shelf length
Add_FijShelflength_constraint <- function (shelf_index) {
shelf_cols_mydata <- ((1:(P)) + (shelf_index - 1) * P)
FF_shelf_cols <- ((1:(P)) + (shelf_index - 1) * P) + nrow(mydata)
add.constraint(
Shelf_choice,
c(mydata$Pw_j[shelf_cols_mydata]),
# width of each product
indices = c(FF_shelf_cols),
# indices of products per shelf in Fij Matrix
rhs = shelves$Sl_i[shelf_index]
) # length of each shelf
}
lapply(1:S, Add_FijShelflength_constraint) # list apply this by shelf index
## add minimum number of total facings
Add_min_facings_constraint <- function (prod_index) {
FjSi_cols <-
(0:(S - 1)) * P + prod_index + nrow(mydata) # index of the products by column in out table
add.constraint(
Shelf_choice,
rep(1, S),
# repeat value the same number of times as shelves
indices = FjSi_cols,
# index of products by column (eg. 1,41,81,121)
type = ">=",
rhs = mydata$Fmin_j[prod_index]
) # value of minimum number of products
}
lapply(1:P, Add_min_facings_constraint) # list apply this for every product
## add maximum number of facings
Add_max_facings_constraint <- function (prod_index) {
FjSi_cols <-
(0:(S - 1)) * P + prod_index + nrow(mydata)
add.constraint(
Shelf_choice,
rep(1, S),
# repeat value the same number of times as shelves
indices = FjSi_cols,
# index of products by column (eg. 1,41,81,121)
type = "<=",
rhs = mydata$Fmax_j[prod_index]
) # value of maximum number of products
}
lapply(1:P, Add_max_facings_constraint) # list apply this for every product
solve(Shelf_choice)
get.objective(Shelf_choice) # gives the total value of the facings
### Tabulates the results ####
test <- matrix(get.variables(Shelf_choice),
ncol = S * 2,
byrow = F)
rownames(test) <- paste0("Product", 1:40)
colnames(test) <- c(rep(paste0("Shelf", 1:4), 2))
test[, 5:8] # shows the product placements (uneven products between shelves)
#
Results:
Product
Shelf 1
Shelf 2
Shelf 3
Shelf 4
P1
0
0
0
2
-
-
-
-
-
P11
1
7
0
0
P16
2
2
0
0
I need, for example, that product 11 has the same number of products on each shelf (4 on each)
I have tried to create a constraint such as :
Sum_Xshelf_constraint <- function (prod_index) {
binary_sum <-
sum(get.variables(Shelf_choice)[(0:(S - 1)) * P + prod_index])
total_Fij <-
sum(get.variables(Shelf_choice)[(0:(S - 1)) * P + prod_index + nrow(df)])
total_cols <-
(0:(S - 1)) * P + prod_index + nrow(df) # index of the products in Fij
for (i in 1:length(total_cols)) {
add.constraint(
Shelf_choice,
c(binary_sum),
indices = c(total_cols[i]),
type = "<=",
rhs = total_Fij
) # value of minimum number which is Fmin_j
}
}
### At least one product on a shelf
lapply(1:P, Sum_Xshelf_constraint)
This unsurprisingly will not work before solving, and once solved it has no effect.
Any ideas how to achieve this? Thank you in advance.
It your decision variables are PiSj (#ith Product on jth Shelf) then a series of chained equations could work:
P1S1 = P1S2
P1S2 = P1S3
P1S3 = P1S4
That kind of simplicity can sometimes markedly improve performance.
I have to answer my own problem here:
It is not possible.
Reading carefully it seems that this particular constraint is quadratic... (http://lpsolve.sourceforge.net/5.5/)
"Suppose that xj must take an integer value i:
yj / y0 = i
or
yj = i * y0
Unfortunately, this constraint can't be handled by lpsolve since it is quadratic."
I will edit this answer with the quadratic solution when it is complete. Thought I had better post this now so that others don't put too much effort into trying to solve it.
Thanks!
Related
Recently, I learned how to write a loop that initializes some number, and then randomly generates numbers until the initial number is guessed (while recording the number of guesses it took) such that no number will be guessed twice:
# https://stackoverflow.com/questions/73216517/making-sure-a-number-isnt-guessed-twice
all_games <- vector("list", 100)
for (i in 1:100){
guess_i = 0
correct_i = sample(1:100, 1)
guess_sets <- 1:100 ## initialize a set
trial_index <- 1
while(guess_i != correct_i){
guess_i = sample(guess_sets, 1) ## sample from this set
guess_sets <- setdiff(guess_sets, guess_i) ## remove it from the set
trial_index <- trial_index + 1
}
## no need to store `i` and `guess_i` (as same as `correct_i`), right?
game_results_i <- data.frame(i, trial_index, guess_i, correct_i)
all_games[[i]] <- game_results_i
}
all_games <- do.call("rbind", all_games)
I am now trying to modify the above code to create the following two loops:
(Deterministic) Loop 1 will always guess the midpoint (round up) and told if their guess is smaller or bigger than the correct number. They will then re-take the midpoint (e.g. their guess and the floor/ceiling) until they reach the correct number.
(Semi-Deterministic) Loop 2 first makes a random guess and is told if their guess is bigger or smaller than the number. They then divide the difference by half and makes their next guess randomly in a smaller range. They repeat this process many times until they reach the correct number.
I tried to write a sketch of the code:
#Loop 2:
correct = sample(1:100, 1)
guess_1 = sample(1:100, 1)
guess_2 = ifelse(guess_1 > correct, sample(50:guess_1, 1), sample(guess_1:100, 1))
guess_3 = ifelse(guess_2 > correct, sample(50:guess_2, 1), sample(guess_2:100, 1))
guess_4 = ifelse(guess_4 > correct, sample(50:guess_3, 1), sample(guess_3:100, 1))
#etc
But I am not sure if I am doing this correctly.
Can someone please help me with this?
Thank you!
Example : Suppose I pick the number 68
Loop 1: first random guess = 51, (100-51)/2 + 51 = 75, (75-50)/2 + 50 = 63, (75 - 63)/2 + 63 = 69, (69 - 63)/2 + 63 = 66, etc.
Loop 2: first random guess = 53, rand_between(53,100) = 71, rand_between(51,71) = 65, rand(65,71) = 70, etc.
I don't think you need a for loop for this, you can create structures since the beginning, with sample, sapply and which:
## correct values can repeat, so we set replace to TRUE
corrects <- sample(1:100, 100, replace = TRUE)
## replace is by default FALSE in sample(), if you don't want repeated guesses
## sapply() creates a matrix
guesses <- sapply(1:100, function(x) sample(1:100, 100))
## constructing game_results_i equal to yours, but could be simplified
game_results_i <- data.frame(
i = 1:100,
trial_index = sapply(
1:100,
function(x) which(
## which() returns the index of the first element that makes the predicate true
guesses[, x] == corrects[x]
)
),
guess_i = corrects,
correct_i = corrects # guess_i and correct_i are obviously equal
)
Ok, let's see if now I match question and answer properly :)
If I got correctly your intentions, in both loops, you are setting increasingly finer lower and upper bounds. Each guess reduces the search space. However, this interpretation does not always match your description, please double check if it can be acceptable for your purposes.
I wrote two functions, guess_bisect for the deterministic loop_1 and guess_sample for loop_2:
guess_bisect <- function(correct, n = 100) {
lb <- 0
ub <- n + 1
trial_index <- 1
guess <- round((ub - lb) / 2) + lb
while (guess != correct) {
# cat(lb, ub, guess, "\n") # uncomment to print the guess iteration
if (guess < correct)
lb <- guess
else
ub <- guess
guess <- round((ub - lb) / 2) + lb
trial_index <- trial_index + 1
}
trial_index
}
guess_sample <- function(correct, n = 100) {
lb <- 0
ub <- n + 1
trial_index <- 1
guess <- sample((lb + 1):(ub - 1), 1)
while (guess != correct) {
# cat(lb, ub, guess, "\n") # uncomment to print the guess iteration
if (guess < correct)
lb <- guess
else
ub <- guess
guess <- sample((lb + 1):(ub - 1), 1)
trial_index <- trial_index + 1
}
trial_index
}
Obviously, guess_bisect always produces the same results with the same input, guess_sample changes randomly instead.
By plotting the results in a simple chart, it seems that the deterministic bisection is on the average much better, as the random sampling may become happen to pick improvements from the wrong sides. x-axis is the correct number, spanning 1 to 100, y-axis is the trial index, with guess_bisect you get the red curve, with many attempts of guess_sample you get the blue curves.
What I look for is basically an R-version of the answer to this question: Generating all permutation of numbers that sums up to N. First of all the answer uses java, which I have a really hard time reading. Second of all the code uses "deque", which I cant figure out a way to implement in R.
I have found several algorithms to do this, but they have all been written in programming languages using structures not available in R such as deques, heaps or list-comprehensions.
What I actually need is a way of finding all the vectors v of length N-1 where:
sum(v * 1:(N-1)) == N
and I think I can manage that myself if only I find a way of obtaining all the ordered integer partitions.
As an example for N = 4 all the ordered integer partitions using numbers 1 to N-1 are:
1+1+1+1
1+1+2
1+3
2+2
What I effectively need is output of the either form:
c(1,1,1,1)
c(1,1,2)
c(1,3)
c(2,2)
Or of the form:
c(4,0,0)
c(2,1,0)
c(1,0,1)
c(0,2,0)
since I should be able to convert the former format to the latter by myself. Any hint as to how to approach this problem using R would be greatly appreciated. The latter format is excactly the vectors v such that sum(v * 1:3) is 4.
EDIT:
My own attempt:
rek = function(mat, id1, id2){
if(id1 + id2 != length(mat) + 1){ #If next state not absorbing
mat[id1] = mat[id1] - 1
mat[id2] = mat[id2] - 1
mat[id1+id2] = mat[id1+id2] + 1
out = mat
id = which(mat > 0)
for(i in id){
for(j in id[id>=i]){
if(j == i & mat[i] == 1){
next
}
out = rbind(out, rek(mat,i,j))
}
}
return(out)
}
}
start = c(n, rep(0, n-2))
states = rbind(start, rek(start, 1, 1))
states = states[!duplicated(states), ] #only unique states.
This is incredibly inefficient. E. g. when n = 11, my states has over 120,000 rows prior to removing duplicates, which leaves only 55 rows.
EDIT 2:
Using the parts() function described below I came up with:
temp = partitions::parts(n)
temp = t(temp)
for(i in 1:length(temp[,1])){
row = temp[i,]
if(any(row>(n-1))){#if absorbing state
next
}
counts = plyr::count(row[row>0])
newrow = rep(0,n-1)
id = counts$x
numbs = counts$freq
newrow[id] = numbs
states = rbind(states, newrow)
}
states = states[-1,]#removing the first row, added manually
which excactly gives me the vectors v such that sum(v * 1:(N-1)) is N.
If anyone is interested, this is to be used within coalescent theory, as a way to describe the possible relations between N individuals omitting when all are related. As an example with N = 4:
(4, 0, 0) -- No individuals are related
(2, 1, 0) -- Two individuals are related, the rest are not
(0, 2, 0) -- The individuals are pair-wise related
(1, 0, 1) -- Three individuals are related, the other individual is not.
Hope parts from package partitions could help
library(partitions)
N <- 4
res <- unique(lapply(asplit(parts(N),2),function(x) sort(x[x>0])))[-1]
which gives
> res
[[1]]
[1] 1 3
[[2]]
[1] 2 2
[[3]]
[1] 1 1 2
[[4]]
[1] 1 1 1 1
If you would like to write a custom base R function, here is a recursive version
f <- function(n, vhead = n, v = c()) {
if (n == 0) return(list(v))
unlist(lapply(seq_len(min(n, vhead)), function(k) f(n - k, k, c(k,v))), recursive = FALSE)
}
then we can run
res <- Filter(function(x) length(x)>1,f(N))
I try to assign elements randomly to groups, but it seems that it's either surprisingly complicated or I am missing the obvious solution.
Remark: My attempt is written in R but I am happy about every comprehensible answer in any language.
Input Data:
A vector o containing the available units of an object where from each object up to p-1 units might be available. Example for o: (0, 1, 5, 8, p-1, ...)
Furthermore there is a vector g containing available spots in groups. Each group can take either p or p + 1 units. Example for g: (p, p + 1, p, p, p + 1, ...)
The following restriction applies to the input data:
All groups have just enough spots for all available objects sum(o) == sum(g)
Expected Output
The available objects should be randomly distributed to the groups, but no group can hold more than one unit of each object. The order in which units are assigned to a group is not relevant.
Example
Example 1:
elements <- c(2, 1)
group_sizes <- c(1, 2)
The only possible solution is
(1), (1, 2)
Explanation: 1 Unit of the first object in the first group, the groups contains one unit of both objects
Example 2:
elements <- c(2,1,1)
group_sizes <- c(1, 2, 1)
Possible solutions:
(1), (1, 2), (3)
(1), (1, 3), (2)
(1), (2, 3), (1)
(2), (1, 3), (1)
(3), (1, 2), (1)
I am looking for an algorithm returning for an given input one of the possible distributions. All distributions should have an equal chance to be returned.
** My unsuccessful approach**
I thought about assigning per iteration one unit to each group. Units cannot be assigned to groups already containing a unit of the same object. Distribution order is starting with the most common object to least common object and then restarting with the most common object.
# Group sizes
group_sizes <- c(8,9,9,9,9,9,9,8,8)
# Available units of objects (already sorted by size)
object_units <- c(8,8,8,8,8,8,8,8,8,6)
# Generate a distributions order
dist_order <- numeric(0)
while(max(object_units) > 0){
dist_order <- c(dist_order, which(object_units > 0))
object_units <- object_units - 1
}
n_groups <- length(group_sizes)
result <- matrix(numeric(0), ncol = n_groups, nrow = min(group_sizes))
set.seed(1)
# distribute batch with
for(i in 1:min(group_sizes)){
# get the objects of which units are distributed in this batch
ind <- (i - 1) * n_groups + 1
current <- v[ind:(ind + n_groups - 1)]
# beginning with the batch all groups can be distributed
avail_groups <- 1: n_groups
#iterate over groups an select object to assign to group
for (j in 1:n_groups) {
# current object can only be assigned to where it has not been assigned yet
free <- which(colSums(result == current[j], na.rm = TRUE) == 0)
# current object can only be assigned to where no other object has been assigned to in this batch
current_avail_groups <- intersect(avail_groups, free)
# select from remaining groups on to assign the assign the object
if(length(current_avail_groups) == 1){
selected_group <- current_avail_groups
} else {
selected_group <- sample(current_avail_groups, size = 1)
}
# store assignment in result matrix
result[i, selected_group] <- current[j]
# remove group from available groups for this iteration
avail_groups <- avail_groups[selected_group != avail_groups]
}
}
This is no valid algorithm as at it is possible that at some point for some object all groups either already contain a unit of the object or received already a unit of another object within this iteration.
I'm trying the same code as in https://thiloshon.wordpress.com/2018/03/11/build-your-own-word-sentence-prediction-application-part-02/ to do word-level prediction. The input textual data is also in the mentioned link and I use en_US.news.txt file as my only input file.
library(quanteda)
library(data.table)
#read the .txt file
df=readLines('en_US.news.txt')
#take a sample of the df
sampleHolderNews <- sample(length(df), length(df) * 0.1)
US_News_Sample <- df[sampleHolderNews]
#build the corpus of the data
corp <- corpus(US_News_Sample)
#Preprocessing
master_Tokens <- tokens(x = tolower(corp),remove_punct =
TRUE,remove_numbers = TRUE,remove_hyphens = TRUE,remove_symbols = TRUE)
stemed_words <- tokens_wordstem(master_Tokens, language = "english")
#tokenization#
bi_gram <- tokens_ngrams(stemed_words, n = 2)
tri_gram <- tokens_ngrams(stemed_words, n = 3)
uni_DFM <- dfm(stemed_words)
bi_DFM <- dfm(bi_gram)
tri_DFM <- dfm(tri_gram)
uni_DFM <- dfm_trim(uni_DFM, 3)
bi_DFM <- dfm_trim(bi_DFM, 3)
tri_DFM <- dfm_trim(tri_DFM, 3)
sums_U <- colSums(uni_DFM)
sums_B <- colSums(bi_DFM)
sums_T <- colSums(tri_DFM)
# Create data tables with individual words as columns
uni_words <- data.table(word_1 = names(sums_U), count = sums_U)
bi_words <- data.table(
word_1 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 1),
word_2 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 2),
count = sums_B)
tri_words <- data.table(
word_1 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 1),
word_2 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 2),
word_3 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 3),
count = sums_T)
#indexing#
setkey(uni_words, word_1)
setkey(bi_words, word_1, word_2)
setkey(tri_words, word_1, word_2, word_3)
######## Finding Bi-Gram Probability #################
discount_value <- 0.75
# Finding number of bi-gram words
numOfBiGrams <- nrow(bi_words[.(word_1, word_2)])
# Dividing number of times word 2 occurs as second part of bigram, by total number of bigrams.
# Finding probability for a word given the number of times it was second word of a bigram
ckn <- bi_words[, .(Prob = ((.N) / numOfBiGrams)), by = word_2]
setkey(ckn, word_2)
# Assigning the probabilities as second word of bigram, to unigrams
uni_words[, Prob := ckn[word_1, Prob]]
uni_words <- uni_words[!is.na(uni_words$Prob)]
# Finding number of times word 1 occurred as word 1 of bi-grams
n1wi <- bi_words[, .(N = .N), by = word_1]
setkey(n1wi, word_1)
# Assigning total times word 1 occured to bigram cn1
bi_words[, Cn1 := uni_words[word_1, count]]
# Kneser Kney Algorithm
bi_words[, Prob := ((count - discount_value) / Cn1 + discount_value / Cn1 *
n1wi[word_1, N] * uni_words[word_2, Prob])]
######## End of Finding Bi-Gram Probability #################
######## Finding Tri-Gram Probability #################
# Finding count of word1-word2 combination in bigram
tri_words[, Cn2 := bi_words[.(word_1, word_2), .N]]
n1w12 <- tri_words[, .N, by = .(word_1, word_2)]
setkey(n1w12, word_1, word_2)
# Kneser Kney Algorithm
tri_words[, Prob := ((count - discount_value) / Cn2 + discount_value / Cn2 *
n1w12[.(word_1, word_2), .N] * bi_words[.(word_1, word_2), Prob])]
Here I get the following error for Kneser algorithm for trigrams:
Error in `[.data.table`(tri_words, , `:=`(Prob, ((count - discount_value)/Cn2 + :
Supplied 13867 items to be assigned to 3932 items of column 'Prob'. If you wish to 'recycle'
the RHS please use rep() to make this intent clear to readers of your code.
In addition: Warning messages:
1: In discount_value/Cn2 * n1w12[list(word_1, word_2), .N] * bi_words[list(word_1, :
longer object length is not a multiple of shorter object length
2: In (count - discount_value)/Cn2 + discount_value/Cn2 * n1w12[list(word_1, :
longer object length is not a multiple of shorter object length
I could find some similar questions related to data table error but I can't understand how should I solve this error in the code.
The problem is in your attempt to multiply the quantities in the last line. This expression:
(count - discount_value) / Cn2 + discount_value / Cn2
is length 20, like tri_words. But the next expression
n1w12[.(word_1, word_2), .N]
is length 19. Then the last part,
bi_words[.(word_1, word_2), Prob])
is length 155 (and contains a lot of NAs).
The error messages are saying that the shorter item cannot be recycled into the longer item because the longer item's length is not a multiple of the length of the shorter item. To fix this, you need to implement this algorithm more carefully.
I am trying to simulate the Chinese Restaurant process in R, and wondering if I can make any efficiency improvements over this crude implementation.
iTables = 200 # number of tables
iSampleSize = 1000 # number of diners
# initialize the list of tables
listTableOccupants = vector('list', iTables)
for(currentDiner in seq.int(iSampleSize)) {
# occupation probabilities for the next diner
vProbabilities = sapply(listTableOccupants,
function(x) ifelse(!is.null(x),
length(x)/currentDiner,
1/currentDiner))
# pick the index of the lucky table
iTable = sample.int(iTables, size = 1, prob = vProbabilities)
# add to the list element corresponding to the table
listTableOccupants[[iTable]] =
c(listTableOccupants[[iTable]], currentDiner)
}
In particular, I am concerned about this line:
# add to the list element corresponding to the table
listTableOccupants[[iTable]] =
c(listTableOccupants[[iTable]], currentDiner)
Is this efficient?
To avoid space reallocation and sparse data structures, you can instead apply a table label to each diner. For example,
nDnr <- 100 # number of diners; must be at least 2
vDnrTbl <- rep(0, nDnr) # table label for each diner
alpha <- 2 # CRP parameter
vDnrTbl[1] <- 1
for (dnr in 2:length(vDnrTbl)) {
# compute occupation probabilities for current diner
vOcc <- table(vDnrTbl[1:(dnr-1)])
vProb <- c(vOcc, alpha) / (dnr - 1 + alpha)
# add table label to diner
nTbl <- as.numeric(names(vOcc)[length(vOcc)]) # avoid overhead of finding max of possibly large vector
vDnrTbl[dnr] <- sample.int(nTbl+1, size=1, prob=vProb)
}
From vDnrTbl, you can obtain listTableOccupants:
nTbl <- max(c(nTbl, vDnrTbl[dnr]))
listTableOccupants <- lapply(1:nTbl, function(t) which(vDnrTbl == t))