I have a large two-step optimization problem that I've tried to simplify for this question. The first step is to choose 10 elements to maximize utility with certain constraints. I need 200 of these sets, but due to the nature of what I'm trying to do, there needs to be 600 generated so that the correct combinations can manifest.
Wrapping around these mini-optimization problems is a larger constraint where each individual element can only be used within a certain range. The first optimization tunes each element's utility so that each one is relatively close to the bounds, but it's not possible for all of them to be within their bounds. Therefore, the second step is to choose 200 of the 600 sets such that each individual element's min/max usage is satisfied. This is what I need help with.
I made a function using lpSolve that works, but over 80% of the time it freezes RStudio and it's just becoming too much of a hassle - I need to either improve my current approach, or need a completely new approach. I don't know if lpSolve is really the best approach to begin with. While I do have an overall set-score that I can maximize, all I really care about is having each element within the bounds. I've made a simplified example to get at the essence of my problem.
I'm in charge of making 200 meals from a set of 80 different fruits. Each meal uses 10 fruit and cannot have more than 1 of the same fruit. I'm limited in the number of fruits that I have (and my boss is making me use a minimum of each fruit otherwise they'll go bad), so they need to be within certain bounds. I have a list of 600 meals already created (Meals), and each one has it's own unique Health-Score. Ideally I would like to maximize the Health-Score, but obviously the most important piece is that each fruit is used the correct number of times, otherwise the meals can't be made in the first place.
Here's my code to 1) Setup the 600 Meals (random) 2) Set the min/max times each fruit must be used (random) 3) Run a linear optimization to select 200 of the 600 meals such that the individual fruit constraints are fulfilled. The program tries to chose 200 of the 600, but if the constraints don't allow it, then it loosens the constraints (e.g. if the solver doesn't work the first time, then I'll decrease the minimum number of times an Apple can be used, and increase the maximum number of times it can be used). It does this one fruit at a time, rather than all at once. Eventually the constraints should be loosened so much that any 200 of the 600 will work (i.e. when all fruit minPercent is less than 0 and all fruit maxPercent is greater than 100), but it doesn't matter because R freezes up.
library(stringr)
library(dplyr)
library(lpSolve)
# Inputs
MealsNeeded <- 200
Buffer <- 3
# Setup the meals (this is the output of another optimizer in my actual program. Considered "Step 1" as I mentioned above)
Meals <- data.frame()
for(i in 1:(MealsNeeded*Buffer)){
run <- i
meal <- sample(fruit, 10)
healthFactor <- round(runif(1, 10, 30), 0) #(Health factor for the entire meal)
df <- data.frame(Run = run, Fruit = meal, healthFactor = healthFactor, stringsAsFactors = FALSE)
Meals <- rbind(Meals, df)
}
# The minimum/maximum number of times each fruit must be used across all 200 meals (these would be inputs in my program)
set.seed(11)
fruitDF <- data.frame(Name = fruit, minSelectPct = round(runif(length(fruit), .05, .1)*100, 0), stringsAsFactors = FALSE) %>%
mutate(maxSelectPct = round(minSelectPct/2 + runif(length(fruit), .05, .1)*100, 0))
#### Actual Program Start
# Get objective
obj <- Meals %>%
distinct(Run, healthFactor) %>%
ungroup() %>%
select(healthFactor) %>%
pull()
# Dummy LU - for each fruit give 1/0 whether or not they were in the meal
dummyLUInd <- data.frame(FruitName = fruitDF$Name, stringsAsFactors = FALSE)
for(i in unique(Meals$Run)){
selectedFruit <- Meals %>%
filter(Run == i) %>%
select(Fruit) %>%
mutate(Indicator = 1)
dummyLUIndTemp <- fruitDF %>%
left_join(selectedFruit, by = c('Name' = 'Fruit')) %>%
mutate(Indicator = ifelse(is.na(Indicator), 0, Indicator)) %>%
select(Indicator)
dummyLUInd <- cbind(dummyLUInd, dummyLUIndTemp)
}
## Table create
dummyLUInd <- rbind(dummyLUInd, dummyLUInd)[,-1]
dummyLUInd <- as.data.frame(t(dummyLUInd))
dummyLUInd$Total = 1
## Directions
dirLT <- c(rep('<=', (ncol(dummyLUInd)-1)/2))
dirGT <- c(rep('>=', (ncol(dummyLUInd)-1)/2))
## Multiply percentages by total Meals
MinExp = round(fruitDF$minSelectPct/100 * MealsNeeded - 0.499, 0)
MaxExp = round(fruitDF$maxSelectPct/100 * MealsNeeded + 0.499, 0)
# Setup constraints like # of tries
CounterMax <- 10000
LPSum = 0
Counter = 0
# Create DF to make it easier to change constraints for each run
MinExpDF <- data.frame(Place = 1:length(MinExp), MinExp = MinExp)
MaxExpDF <- data.frame(Place = 1:length(MaxExp), MaxExp = MaxExp)
cat('\nStarting\n')
Sys.sleep(2)
# Try to get the 200 of 600 Meals that satisfy the constraints for the individual Fruit.
# If the solution doesn't exist, loosen the constraints for each fruit (one at a time) until it does work
while (LPSum == 0 & Counter <= CounterMax) {
rowUse <- Counter %% length(MaxExp)
# Knock one of minimum, starting with highest exposure, one at a time
MinExpDF <- MinExpDF %>%
mutate(Rank = rank(-MinExp, na.last = FALSE, ties.method = "first"),
MinExp = ifelse(Rank == rowUse, MinExp - 1, MinExp)
)
MinExp <- MinExpDF$MinExp
# Add one of maximum, starting with highest exposure, one at a time
MaxExpDF <- MaxExpDF %>%
mutate(Rank = rank(-MaxExp, na.last = FALSE, ties.method = "first"),
MaxExp = ifelse(Rank == rowUse, MaxExp + 1, MaxExp))
MaxExp <- MaxExpDF$MaxExp
# Solve
dir <- 'max'
f.obj <- obj
f.mat <- t(dummyLUInd)
f.dir <- c(dirGT, dirLT, '==')
f.rhs <- c(MinExp, MaxExp, MealsNeeded)
Sol <- lp(dir, f.obj, f.mat, f.dir, f.rhs, all.bin = T)$solution
LPSum <- sum(Sol)
Counter = Counter + 1
if(Counter %% 10 == 0) cat(Counter, ', ', sep = '')
}
# Get the Run #'s from the lpSolve
if(Counter >= CounterMax){
cat("Unable to find right exposure, returning all Meals\n")
MealsSolved <- Meals
} else {
MealsSolved <- data.frame(Run = unique(Meals$Run))
MealsSolved$selected <- Sol
MealsSolved <- MealsSolved[MealsSolved$selected == 1,]
}
# Final Meals
FinalMeals <- Meals %>%
filter(Run %in% MealsSolved$Run)
If you run this code enough times, eventually RStudio will freeze up on you (at least it does for me, if it doesn't for you I suppose increase the number of Meals). It happens during the actual lp, so there's really not much you can do since it's really C code. This is where I'm lost.
A part of me thinks this isn't really an lpSolve issue since I'm not really trying to maximize anything (Health-Factor isn't all too important). My real "loss function" is the number of times each fruit goes above/below their min/max exposure, but I can't think of how to set something like this up. Can my current approach work, or do I need to do something different completely?
Related
I'm looking to make a set of two random numbers (e.g., [1,2], [3,12]) with the first number between 1-12, and the second between 1-4. I know how to sample the two numbers independently using:
sample(1:12, 1, replace = T)
sample(1:4, 1, replace = T)
but don't know how to create a system to determine if the pairing of the two numbers has already been rolled, and if so, roll again. Any tips!?
Thanks :)
While this doesn't scale happily (in case you need large-scale simulation), you can do this:
set.seed(42)
di2 <- sample(setdiff(1:4, di1 <- sample(1:12, size = 1)), size = 1)
c(di1, di2)
# [1] 1 2
The inner (di1) assignment takes the first from 1:12, so far so good.
We then set-diff 1:4 from this so that the second sampling only has candidates that are not equal to di1;
The outer (di2) assignment samples from 1:4 without di1 if it was within 1-4.
While not an authoritative proof of correctness,
rand <- replicate(100000, local({ di2 <- sample(setdiff(1:4, di1 <- sample(1:12, size=1)), size = 1); c(di1, di2); }))
dim(rand)
# [1] 2 100000
any(rand[1,] == rand[2,])
# [1] FALSE
Are you looking for sth like:
library(tidyverse)
expand.grid(1:12,1:4) %>%
as.data.frame() %>%
slice_sample (n = 5, replace = FALSE)
Just for fun and to train R, I tried to proof the Monty Hall Game rule (changing your choice after one gate opened gives you more probability to win), I made this reproducible code (The explanation of every step is within the code):
## First I set the seed
set.seed(4)
## Then I modelize the presence of the prize as a random variable between gates 1,2,3
randomgates <- ceiling(runif(10000, min = 0, max = 3))
## so do I with the random choice.
randomchoice <- ceiling(runif(10000, min = 0, max = 3))
## As the opening of a gate is dependent from the gate you chose (the gate you chose cannot be opened)
## I modelize the opening of the gate as a variable which cannot be equal to the choice.
options <- c(1:3)
randomopen <- rep(1,10000)
for (i in 1:length(randomgates)) {
realoptions <- options[options != randomchoice[i]]
randomopen[i] <- realoptions[ceiling(runif(1,min = 0, max = 2))]
}
##Just to make data more easy to handle, I make a dataset
dataset <- cbind(randomgates, randomchoice, randomopen)
## Then I creat a dataset which only keeps the realization of the games in which we carry on (
## the opened gate wasn't the one with the price within)
steptwo <- dataset[randomopen != randomgates,]
## The next step is just to check if the probability of carry on is 2/3, which indeed is
carryon <- randomopen != randomgates
sum(carryon)/length(randomgates)
## I format the dataset as a data frame
steptwo <- as.data.frame(steptwo)
## Now we check what happens if we hold our initial choice when game carries on
prizesholding <- steptwo$randomgates == steptwo$randomchoice
sum(prizesholding)
## creating a vector of changing option, dependant on the opened gate, in the dataset that
## keeps only the cases in which we carried on playing (the opened gate wasn't the one with the prize)
switchedchoice <- rep(1,length(steptwo$randomgates))
for (i in 1:length(steptwo$randomgates)) {
choice <- options[options != steptwo$randomchoice[i]]
switchedchoice[i] <- choice[ceiling(runif(1,min = 0, max = 2))]
}
## Now we check how many times you guess the prize gate when you switch your initial choice
prizesswitching <- steptwo$randomgates == switchedchoice
sum(prizesswitching)/length(steptwo$randomgates)
When I check the probability without changing my initial choice in the cases in which the game carried on (the gate opening didn't match the one with the prize) I obtain what I exepected (close 1/3 of probability of winning the prize), which refers to the following instruction:
carryon <- randomopen != randomgates
sum(carryon)/length(randomgates)
My problem arises when I check the probability of winning the prize after changing my choice (conditionate, obviously to not having opened the door which holds the prize), instead of getting 1/2 as Monty Hall states, I get 1/4, it refers to the following instruction:
prizesswitching <- steptwo$randomgates == switchedchoice
sum(prizesswitching)/length(steptwo$randomgates)
I know that I am doing something bad because it is already more than proofed that Monty Hall holds, but I am not able to detect the flaw. Does anyone know what it is?
If you don't know what Monty Hall problem is, you can find easy-to-read information at wikipedia:
Monty Hall Game
Edit: As #Dason pointed out, one of the problem was I was introducing some kind of randomness in the changing of the initial choice, which doesn't makes sense as there is only one option left.
Other problem was that I was not approaching the problem under the assumption of Monty Hall knowing where the prize is. I changed my code from the initial to this, and the problem is solved:
# Prepare each variable for 10000 experiments
## First I set the seed
set.seed(4)
## Then I modelize the presence of the prize as a random variable between gates 1,2,3
randomgates <- ceiling(runif(10000, min = 0, max = 3))
## so do I with the random choice.
randomchoice <- ceiling(runif(10000, min = 0, max = 3))
## As the opening of a gate is dependent from the gate you chose (the gate you chose cannot be opened
##, neither the one with the prize does), I modelize the opening of the gate as a variable which cannot be equal to the choice.
options <- c(1:3)
randomopen <- rep(1,10000)
for (i in 1:length(randomgates)) {
randomopen[i] <- options[options != randomchoice[i] & options != randomgates[i]]
}
##Just to make data more easy to handle, I make a dataset
dataset <- cbind(randomgates, randomchoice, randomopen)
## I format the dataset as a data frame
steptwo <- as.data.frame(dataset)
## Now we check what happens if we hold our initial choice when game carries on
steptwo$prizesholding <- steptwo$randomgates == steptwo$randomchoice
with(steptwo, sum(prizesholding))
## creating a vector of changing option, dependant on the opened gate, in the dataset that
## keeps only the cases in which we carried on playing (the opened gate wasn't the one with the prize)
steptwo$switchedchoice <- rep(1,length(steptwo$randomgates))
for (i in 1:length(steptwo$randomgates)) {
steptwo$switchedchoice[i] <- options[options != steptwo$randomchoice[i] & options != steptwo$randomopen[i]]
}
## Now we check how many times you guess the prize gate when you switch your initial choice
steptwo$prizesswitching <- steptwo$randomgates == steptwo$switchedchoice
with(steptwo, sum(prizesswitching)/length(randomgates))
Each round, there is a prize_door and a chosen_door. Monty Hall will open a door that is not a prize_door or chosen_door (setdiff between 1:3 and the vector (prize_door, chosen_door), with a random choice between the two if the setdiff is two elements). Then the switch door is the door not chosen or opened.
n <- 1e4
set.seed(2020)
df <-
data.frame(
prize_door = sample(1:3, n, replace = TRUE),
chosen_door = sample(1:3, n, replace = TRUE))
df$opened_door <-
mapply(function(x, y){
available <- setdiff(1:3, c(x, y))
available[sample(length(available), 1)]
}, df$prize_door, df$chosen_door)
df$switch_door <-
mapply(function(x, y) setdiff(1:3, c(x, y)),
df$chosen_door, df$opened_door)
with(df, mean(prize_door == chosen_door))
# [1] 0.3358
with(df, mean(prize_door == switch_door))
# [1] 0.6642
Plot of probabilities as n increases
probs <-
data.frame(
chosen_p = with(df, cumsum(prize_door == chosen_door))/(1:n),
switch_p = with(df, cumsum(prize_door == switch_door))/(1:n))
plot(probs$switch_p, type = 'l', ylim = c(0, 1))
lines(probs$chosen_p, col = 'red')
abline(h = 1/3)
abline(h = 2/3)
This seems to do the trick:
n_iter <- 10000
set.seed(4)
doors <- 1:3
prizes <- sample.int(n = 3, size = n_iter, replace = TRUE)
your_pick <- sample.int(n = 3, size = n_iter, replace = TRUE)
open_door <- rep(0, n_iter)
switched_door <- rep(0, n_iter)
for (i in 1:n_iter) {
remaining_choices <- setdiff(doors, c(your_pick[i], prizes[i]))
if (length(remaining_choices) > 1) {
open_door[i] <- sample(remaining_choices, size = 1)
} else {
open_door[i] <- remaining_choices
}
switched_door[i] <- setdiff(doors, c(your_pick[i], open_door[i]))
}
> mean(your_pick == prizes)
[1] 0.3305
> mean(switched_door == prizes)
[1] 0.6695
The sample.int and sample base functions help simplify things a bit. The remaining_choices item contains the possible doors that can be opened by the game show host, which has a length of 1 or 2 depending on your original choice. If the length is 2, we sample from that vector, and if it's 1, that door is automatically opened.
I am looking towards a set of numbers and aiming to split them into subsets via set partitioning. The deciding factor on how these subsets will be generated will be ensuring that the sum of all the elements in the subset is as close as possible to a number generated by a pre-determined distribution. The subsets need not be the same size and each element can only be in one subset. I had previously been given guidance on this problem via the greedy algorithm (Link here), but I have found that some of the larger numbers in the set drastically skewed the results. I would therefore like to use some form of set partitioning for this problem.
A deeper underlying issue, which I would really like to correct for future problems, is I find I am drawn to the “brute force” approach with these type of problems. (As you can see from my code below which attempts to use folds to solve the problem via “brute force”). This is obviously a completely inefficient way to tackle the problem, and so I would like to tackle these minimization type problems with a more intelligent approach going forward. Therefore any advice is greatly appreciated.
library(groupdata2)
library(dplyr)
set.seed(345)
j <- runif(500,0,10000000)
dist <- c(.3,.2,.1,.05,.065,.185,.1)
s_diff <- 9999999999
for (i in 1:100) {
x <- fold(j, k = length(dist), method = "n_rand")
if (abs(sum(j) * dist[1] - sum(j[which(x$.folds==1)])) < abs(s_diff)) {
s_diff <- abs(sum(j) * dist[1] - sum(j[which(x$.folds==1)]))
x_fin <- x
}
}
This is just a simplified version only looking at the first ‘subset’. s_diff would be the smallest difference between the theoretical and actual results simulated, and x_fin would be which subset each element would be in (ie which fold it corresponds to). I was then looking to remove the elements that fell into the first subset and continue from there, but I know my method is inefficient.
Thanks in advance!
This is not a trivial problem, as you will probably gather from the complete lack of answers at 10 days, even with a bounty. As it happens, I think it is a great problem for thinking about algorithms and optimizations, so thanks for posting.
The first thing I would note is that you are absolutely right that this is not the kind of problem with which to try brute force. You may get near to a correct answer, but with a non-trivial number of samples and distribution points, you won't find the optimum solution. You need an iterative approach that moves elements about only if they make the fit better, and the algorithm needs to stop when it can't make it any better.
My approach here is to split the problem into three stages:
Cut the data into approximately the correct bins as a first approximation
Move elements from the bins that are a bit too big to the ones that are a bit too small. Do this iteratively until no more moves will optimize the bins.
Swap the elements between columns to fine tune the fit, until the swaps are optimal.
The reason to do it in this order is that each step is computationally more expensive, so you want to pass a better approximation to each step before letting it do its thing.
Let's start with a function to cut the data into approximately the correct bins:
cut_elements <- function(j, dist)
{
# Specify the sums that we want to achieve in each partition
partition_sizes <- dist * sum(j)
# The cumulative partition sizes give us our initial cuts
partitions <- cut(cumsum(j), cumsum(c(0, partition_sizes)))
# Name our partitions according to the given distribution
levels(partitions) <- levels(cut(seq(0,1,0.001), cumsum(c(0, dist))))
# Return our partitioned data as a data frame.
data.frame(data = j, group = partitions)
}
We want a way to assess how close this approximation (and subsequent approximations) are to our answer. We can plot against the target distribution, but it will also be helpful to have a numerical figure to assess the goodness of fit to include on our plot. Here, I will use the sum of the squares of the differences between the sample bins and the target bins. We'll use the log to make the numbers more comparable. The lower the number, the better the fit.
library(dplyr)
library(ggplot2)
library(tidyr)
compare_to_distribution <- function(df, dist, title = "Comparison")
{
df %>%
group_by(group) %>%
summarise(estimate = sum(data)/sum(j)) %>%
mutate(group = factor(cumsum(dist))) %>%
mutate(target = dist) %>%
pivot_longer(cols = c(estimate, target)) ->
plot_info
log_ss <- log(sum((plot_info$value[plot_info$name == "estimate"] -
plot_info$value[plot_info$name == "target"])^2))
ggplot(data = plot_info, aes(x = group, y = value, fill = name)) +
geom_col(position = "dodge") +
labs(title = paste(title, ": log sum of squares =", round(log_ss, 2)))
}
So now we can do:
cut_elements(j, dist) %>% compare_to_distribution(dist, title = "Cuts only")
We can see that the fit is already pretty good with a simple cut of the data, but we can do a lot better by moving appropriately sized elements from the over-sized bins to the under-sized bins. We do this iteratively until no more moves will improve our fit. We use two nested while loops, which should make us worry about computation time, but we have started with a close match, so we shouldn't get too many moves before the loop stops:
move_elements <- function(df, dist)
{
ignore_max = length(dist);
while(ignore_max > 0)
{
ignore_min = 1
match_found = FALSE
while(ignore_min < ignore_max)
{
group_diffs <- sort(tapply(df$data, df$group, sum) - dist*sum(df$data))
group_diffs <- group_diffs[ignore_min:ignore_max]
too_big <- which.max(group_diffs)
too_small <- which.min(group_diffs)
swap_size <- (group_diffs[too_big] - group_diffs[too_small])/2
which_big <- which(df$group == names(too_big))
candidate_row <- which_big[which.min(abs(swap_size - df[which_big, 1]))]
if(df$data[candidate_row] < 2 * swap_size)
{
df$group[candidate_row] <- names(too_small)
ignore_max <- length(dist)
match_found <- TRUE
break
}
else
{
ignore_min <- ignore_min + 1
}
}
if (match_found == FALSE) ignore_max <- ignore_max - 1
}
return(df)
}
Let's see what that has done:
cut_elements(j, dist) %>%
move_elements(dist) %>%
compare_to_distribution(dist, title = "Cuts and moves")
You can see now that the match is so close we are struggling to see whether there is any difference between the target and the partitioned data. That's why we needed the numerical measure of GOF.
Still, let's get this fit as good as possible by swapping elements between columns to fine-tune them. This step is computationally expensive, but again we are already giving it a close approximation, so it shouldn't have much to do:
swap_elements <- function(df, dist)
{
ignore_max = length(dist);
while(ignore_max > 0)
{
ignore_min = 1
match_found = FALSE
while(ignore_min < ignore_max)
{
group_diffs <- sort(tapply(df$data, df$group, sum) - dist*sum(df$data))
too_big <- which.max(group_diffs)
too_small <- which.min(group_diffs)
current_excess <- group_diffs[too_big]
current_defic <- group_diffs[too_small]
current_ss <- current_excess^2 + current_defic^2
all_pairs <- expand.grid(df$data[df$group == names(too_big)],
df$data[df$group == names(too_small)])
all_pairs$diff <- all_pairs[,1] - all_pairs[,2]
all_pairs$resultant_big <- current_excess - all_pairs$diff
all_pairs$resultant_small <- current_defic + all_pairs$diff
all_pairs$sum_sq <- all_pairs$resultant_big^2 + all_pairs$resultant_small^2
improvements <- which(all_pairs$sum_sq < current_ss)
if(length(improvements) > 0)
{
swap_this <- improvements[which.min(all_pairs$sum_sq[improvements])]
r1 <- which(df$data == all_pairs[swap_this, 1] & df$group == names(too_big))[1]
r2 <- which(df$data == all_pairs[swap_this, 2] & df$group == names(too_small))[1]
df$group[r1] <- names(too_small)
df$group[r2] <- names(too_big)
ignore_max <- length(dist)
match_found <- TRUE
break
}
else ignore_min <- ignore_min + 1
}
if (match_found == FALSE) ignore_max <- ignore_max - 1
}
return(df)
}
Let's see what that has done:
cut_elements(j, dist) %>%
move_elements(dist) %>%
swap_elements(dist) %>%
compare_to_distribution(dist, title = "Cuts, moves and swaps")
Pretty close to identical. Let's quantify that:
tapply(df$data, df$group, sum)/sum(j)
# (0,0.3] (0.3,0.5] (0.5,0.6] (0.6,0.65] (0.65,0.715] (0.715,0.9]
# 0.30000025 0.20000011 0.10000014 0.05000010 0.06499946 0.18500025
# (0.9,1]
# 0.09999969
So, we have an exceptionally close match: each partition is less than one part in one million away from the target distribution. Quite impressive considering we only had 500 measurements to put into 7 bins.
In terms of retrieving your data, we haven't touched the ordering of j within the data frame df:
all(df$data == j)
# [1] TRUE
and the partitions are all contained within df$group. So if we want a single function to return just the partitions of j given dist, we can just do:
partition_to_distribution <- function(data, distribution)
{
cut_elements(data, distribution) %>%
move_elements(distribution) %>%
swap_elements(distribution) %>%
`[`(,2)
}
In conclusion, we have created an algorithm that creates an exceptionally close match. However, that's no good if it takes too long to run. Let's test it out:
microbenchmark::microbenchmark(partition_to_distribution(j, dist), times = 100)
# Unit: milliseconds
# expr min lq mean median uq
# partition_to_distribution(j, dist) 47.23613 47.56924 49.95605 47.78841 52.60657
# max neval
# 93.00016 100
Only 50 milliseconds to fit 500 samples. Seems good enough for most applications. It would grow exponentially with larger samples (about 10 seconds on my PC for 10,000 samples), but by that point the relative fineness of the samples means that cut_elements %>% move_elements already gives you a log sum of squares of below -30 and would therefore be an exceptionally good match without the fine tuning of swap_elements. These would only take about 30 ms with 10,000 samples.
To add to the excellent answer by #AllanCameron, here is a solution that utilizes the highly efficient function comboGeneral from RcppAlgos*.
library(RcppAlgos)
partDist <- function(v, d, tol_ratio = 0.0001) {
tot_sum <- d * sum(v)
orig_len <- length(v)
tot_len <- d * orig_len
df <- do.call(rbind, lapply(1L:(length(d) - 1L), function(i) {
len <- as.integer(tot_len[i])
vals <- comboGeneral(v, len,
constraintFun = "sum",
comparisonFun = "==",
limitConstraints = tot_sum[i],
tolerance = tol_ratio * tot_sum[i],
upper = 1)
ind <- match(vals, v)
v <<- v[-ind]
data.frame(data = as.vector(vals), group = rep(paste0("g", i), len))
}))
len <- orig_len - nrow(df)
rbind(df, data.frame(data = v,
group = rep(paste0("g", length(d)), len)))
}
The idea is that we find a subset of v (e.g. j in the OP's case) such that the sum is within a tolerance of sum(v) * d[i] for some index i (d is equivalent to dist in the OP's example). After we find a solution (N.B. we are putting a cap on the number of solutions by setting upper = 1), we assign them to a group, and then remove them from v. We then iterate until we are left with just enough elements in v that will be assigned to the last distributed value (e.g. dist[length[dist]].
Here is an example using the OP's data:
set.seed(345)
j <- runif(500,0,10000000)
dist <- c(.3,.2,.1,.05,.065,.185,.1)
system.time(df_op <- partDist(j, dist, 0.0000001))
user system elapsed
0.019 0.000 0.019
And using the function for plotting by #AllanCameron we have:
df_op %>% compare_to_distribution(dist, "RcppAlgos OP Ex")
What about a larger sample with the same distribution:
set.seed(123)
j <- runif(10000,0,10000000)
## N.B. Very small ratio
system.time(df_huge <- partDist(j, dist, 0.000000001))
user system elapsed
0.070 0.000 0.071
The results:
df_huge %>% compare_to_distribution(dist, "RcppAlgos Large Ex")
As you can see, the solutions scales very well. We can speed up execution by loosening tol_ratio at the expense of the quality of the result.
For reference with the large data set, the solution given by #AllanCameron takes just under 3 seconds and gives a similar log sum of squares values (~44):
system.time(allan_large <- partition_to_distribution(j, dist))
user system elapsed
2.261 0.675 2.938
* I am the author of RcppAlgos
I am working through Digital Dice by Paul Nahin to teach myself Monte Carlo simulations. I am converting the Matlab code in the book to R code on the first pass, then replacing for-loops with tidy versions on the second pass.
Edit: Here is what I am looking to model:
Imagine that you face a pop quiz, a list of the 24 Presidents of the 19th century and another list of their terms in office but scrambled
The object is to match the President with the term
You get one guess every time
On average, how many do you guess correct?
Here is the R code using for-loops:
m <- 24
total_correct <- 0
n <- 10000
for (i in 1:n) {
correct <- 0
term <- sample(m, replace = TRUE)
for (j in 1:m) {
if (term[j] == j) {
correct <- correct + 1
}
}
total_correct = total_correct + correct
}
total_correct <- total_correct / n
print(total_correct)
This works (but I admit gives the wrong answer). Next is to tidy-fy this -- this is my attempt:
crossing(trials = 1:10,
m = 1:24) %>%
mutate(guess = sample.int(24, n(), replace = F), result = m == guess) %>%
summarise(score = sum(result) / n())
However, I get an error message reading
Error in sample.int(x, size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
I understand what's going on: The n() command in the mutate() statement returns 240. Sampling 240 from a population 24 with replace = FALSE is nonsensical hence the error message.
How do I get the mutuate() statement to receive a size of 24 on each iteration (or trial)?
I recently asked a question about improving performance in my code (Faster method than "while" loop to find chain of infection in R).
Background:
I'm analyzing large tables (300 000 - 500 000 rows) that store data output by a disease simulation model. In the model, animals on a landscape infect other animals. For example, in the example pictured below, animal a1 infects every animal on the landscape, and the infection moves from animal to animal, branching off into "chains" of infection.
In my original question, I asked how I could output a data.frame corresponding to animal "d2"s "chain of infection (see below, outlined in green, for illustration of one "chain"). The suggested solution worked well for one animal.
In reality, I will need to calculate chains for about 400 animals, corresponding to a subset of all animals (allanimals table).
I've included a link to an example dataset that is large enough to play with.
Here is the code for one chain, starting with animal 5497370, and note that I've slightly changed column names from my previous question, and updated the code!
The code:
allanimals <- read.csv("https://www.dropbox.com/s/0o6w29lz8yzryau/allanimals.csv?raw=1",
stringsAsFactors = FALSE)
# Here's an example animal
ExampleAnimal <- 5497370
ptm <- proc.time()
allanimals_ID <- setdiff(unique(c(allanimals$ID, allanimals$InfectingAnimal_ID)), -1)
infected <- rep(NA_integer_, length(allanimals_ID))
infected[match(allanimals$ID, allanimals_ID)] <-
match(allanimals$InfectingAnimal_ID, allanimals_ID)
path <- rep(NA_integer_, length(allanimals_ID))
curOne <- match(ExampleAnimal, allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
path[i] <- curOne
i <- i + 1
curOne <- nextOne
}
chain <- allanimals[path[seq_len(i - 1)], ]
chain
proc.time() - ptm
# check it out
chain
I'd like to output chains for each animal in "sel.set":
sel.set <- allanimals %>%
filter(HexRow < 4 & Year == 130) %>%
pull("ID")
If possible, I'd like to store each "chain" data.frame as list with length = number of chains.
So I'll return the indices to access the data frame rather than all data frame subsets. You'll just need to use lapply(test, function(path) allanimals[path, ]) or with a more complicated function inside the lapply if you want to do other things on the data frame subsets.
One could think of just lapply on the solution for one animal:
get_path <- function(animal) {
curOne <- match(animal, allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
path[i] <- curOne
i <- i + 1
curOne <- nextOne
}
path[seq_len(i - 1)]
}
sel.set <- allanimals %>%
filter(HexRow < 4 & Year == 130) %>%
pull("ID")
system.time(
test <- lapply(sel.set, get_path)
) # 0.66 seconds
We could rewrite this function as a recursive function (this will introduce my third and last solution).
system.time(
sel.set.match <- match(sel.set, allanimals_ID)
) # 0
get_path_rec <- function(animal.match) {
`if`(is.na(nextOne <- infected[animal.match]),
NULL,
c(animal.match, get_path_rec(nextOne)))
}
system.time(
test2 <- lapply(sel.set.match, get_path_rec)
) # 0.06
all.equal(test2, test) # TRUE
This solution is 10 times as fast. I don't understand why though.
Why I wanted to write a recursive function? I thought you might have a lot of cases where you want for example to get the path of animalX and animalY where animalY infected animalX. So when computing the path of animalX, you would recompute all path of animalY.
So I wanted to use memoization to store already computed results and memoization works well with recursive functions. So my last solution:
get_path_rec_memo <- memoise::memoize(get_path_rec)
memoise::forget(get_path_rec_memo)
system.time(
test3 <- lapply(sel.set.match, get_path_rec_memo)
) # 0.12
all.equal(test3, test) # TRUE
Unfortunately, this is slower than the second solution. Hope it will be useful for the whole dataset.