Using sample(..., replace = FALSE) in a tidy Monte Carlo simulation using crossing - r

I am working through Digital Dice by Paul Nahin to teach myself Monte Carlo simulations. I am converting the Matlab code in the book to R code on the first pass, then replacing for-loops with tidy versions on the second pass.
Edit: Here is what I am looking to model:
Imagine that you face a pop quiz, a list of the 24 Presidents of the 19th century and another list of their terms in office but scrambled
The object is to match the President with the term
You get one guess every time
On average, how many do you guess correct?
Here is the R code using for-loops:
m <- 24
total_correct <- 0
n <- 10000
for (i in 1:n) {
correct <- 0
term <- sample(m, replace = TRUE)
for (j in 1:m) {
if (term[j] == j) {
correct <- correct + 1
}
}
total_correct = total_correct + correct
}
total_correct <- total_correct / n
print(total_correct)
This works (but I admit gives the wrong answer). Next is to tidy-fy this -- this is my attempt:
crossing(trials = 1:10,
m = 1:24) %>%
mutate(guess = sample.int(24, n(), replace = F), result = m == guess) %>%
summarise(score = sum(result) / n())
However, I get an error message reading
Error in sample.int(x, size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
I understand what's going on: The n() command in the mutate() statement returns 240. Sampling 240 from a population 24 with replace = FALSE is nonsensical hence the error message.
How do I get the mutuate() statement to receive a size of 24 on each iteration (or trial)?

Related

While loop within function stops even when conditions not met

I'm a bit stumped, I'm attempting to write a code that runs Monte Carlo simulations of increasing sample sizes until certain conditions are met. First off, the bit of code that I know does work:
##Step 0 - load packages##
library(tidyverse)
library(ggplot2)
library(ggthemes)
##Step 1 - Define number of cycles per simulation##
ncycles <- 250000
##Step 2 - Define function for generating volumes and checking proportion of failed cycles##
volSim <- function(ncycles){
tols <- rnorm(ncycles,0,0.3) #Generate n unique tolerances
vols <- 0 #Establish vols variable within function
for (tol in 2:ncycles){ #for loop creates n unique volumes from tolerances
vols[tol] <- 2.2+tols[tol]-tols[tol-1]
}
cell <- rnorm(1,3.398864,0.4810948) #Generate a unique threshold
return(c(mean(vols>cell),mean(vols>cell*2),mean(vols>cell*20))) #Output a vector of failure rate
}
This works fine and outputs three values equivalent to the proportion of events over multiples of the threshold. Now, for the bit that's not behaving;
##Step 3 - Define a function to run multiple iterations of simulation and check convergence ##
regres <- function(ncycles){
#Establish parameters used within function#
converged <- FALSE
fail_rate_5k <- 0
se_5k <- 0
ncells <- 0
fail_rate_10k <- 0
se_10k <- 0
fail_rate_100k <- 0
se_100k <- 0
n <- 0
while ((converged == FALSE & n<6) | n<4){
n <- n+1
res <- replicate(2^(n+5),volSim(ncycles))
fail_rate_5k[n] <- mean(res[1,]>0)
se_5k[n] <- sqrt(fail_rate_5k[n]*(1-fail_rate_5k[n])/2^(n+5))
ncells[n] <- 2^(n+5)
fail_rate_10k[n] <- mean(res[2,]>0)
se_10k[n] <- sqrt(fail_rate_10k[n]*(1-fail_rate_10k[n])/2^(n+5))
fail_rate_100k[n] <- mean(res[3,]>0)
se_100k[n] <- sqrt(fail_rate_100k[n]*(1-fail_rate_100k[n])/2^(n+5))
if((fail_rate_5k[n] <= 0 | se_5k[n] < 0.5*fail_rate_5k[n]) &
(fail_rate_10k[n] <= 0 | se_10k[n] < 0.5*fail_rate_10k[n]) &
(fail_rate_100k[n] <= 0 | se_100k[n] < 0.5*fail_rate_100k[n])){
converged <- TRUE}
else {converged <- FALSE}
return(data.frame(k5 = fail_rate_5k, se_k5 = se_5k, ncells_k5 = ncells, k10 = fail_rate_10k, se_k10 = se_10k, ncells_k10 = ncells, k100 = fail_rate_100k, se_k100 = se_100k, ncells_k100 = ncells))}
}
The intention is that the simulation will repeat at increasing sample sizes until the standard error for all fail rates (5k, 10k, 100k) is less than half of the fail rate, or the fail rate itself is zero (to avoid a dividing by zero scenario). Two caveats are that the simulation must run at least four times (the n<4 condition in the while loop), and stop after a maximum of six.
Now, if I run the code within the regres function in isolation (with ncycles set to 250000), I generate a nice data frame with 5 rows, I can see that n = 5, converged = TRUE, and everything else that I expect to be happening within the function just fine. If I run result <- regres(ncycles) however, it outputs a single row data frame every time. The while loop is stopping at n=1 despite the n<4 condition. I cannot for the life of me figure out why the behaviour is different when the function is called from when the code inside it is run in isolation.
While I'm really looking to find out why this method is not working, if the method itself is completely outlandish I'm open to using a different approach entirely too.
Your return statement is in the while loop. It will return the data.frame at the end of the first iteration (essentially a break before it even checks the condition)
Try:
...
converged <- TRUE}
else {converged <- FALSE}
}
return(data.frame(k5 = fail_rate_5k, se_k5 = se_5k, ncells_k5 = ncells, k10 = fail_rate_10k, se_k10 = se_10k, ncells_k10 = ncells, k100 = fail_rate_100k, se_k100 = se_100k, ncells_k100 = ncells))
}

Split a set into n unequal subsets with the key deciding factor being that the elements in the subset aggregate and equal a predetermined amount?

I am looking towards a set of numbers and aiming to split them into subsets via set partitioning. The deciding factor on how these subsets will be generated will be ensuring that the sum of all the elements in the subset is as close as possible to a number generated by a pre-determined distribution. The subsets need not be the same size and each element can only be in one subset. I had previously been given guidance on this problem via the greedy algorithm (Link here), but I have found that some of the larger numbers in the set drastically skewed the results. I would therefore like to use some form of set partitioning for this problem.
A deeper underlying issue, which I would really like to correct for future problems, is I find I am drawn to the “brute force” approach with these type of problems. (As you can see from my code below which attempts to use folds to solve the problem via “brute force”). This is obviously a completely inefficient way to tackle the problem, and so I would like to tackle these minimization type problems with a more intelligent approach going forward. Therefore any advice is greatly appreciated.
library(groupdata2)
library(dplyr)
set.seed(345)
j <- runif(500,0,10000000)
dist <- c(.3,.2,.1,.05,.065,.185,.1)
s_diff <- 9999999999
for (i in 1:100) {
x <- fold(j, k = length(dist), method = "n_rand")
if (abs(sum(j) * dist[1] - sum(j[which(x$.folds==1)])) < abs(s_diff)) {
s_diff <- abs(sum(j) * dist[1] - sum(j[which(x$.folds==1)]))
x_fin <- x
}
}
This is just a simplified version only looking at the first ‘subset’. s_diff would be the smallest difference between the theoretical and actual results simulated, and x_fin would be which subset each element would be in (ie which fold it corresponds to). I was then looking to remove the elements that fell into the first subset and continue from there, but I know my method is inefficient.
Thanks in advance!
This is not a trivial problem, as you will probably gather from the complete lack of answers at 10 days, even with a bounty. As it happens, I think it is a great problem for thinking about algorithms and optimizations, so thanks for posting.
The first thing I would note is that you are absolutely right that this is not the kind of problem with which to try brute force. You may get near to a correct answer, but with a non-trivial number of samples and distribution points, you won't find the optimum solution. You need an iterative approach that moves elements about only if they make the fit better, and the algorithm needs to stop when it can't make it any better.
My approach here is to split the problem into three stages:
Cut the data into approximately the correct bins as a first approximation
Move elements from the bins that are a bit too big to the ones that are a bit too small. Do this iteratively until no more moves will optimize the bins.
Swap the elements between columns to fine tune the fit, until the swaps are optimal.
The reason to do it in this order is that each step is computationally more expensive, so you want to pass a better approximation to each step before letting it do its thing.
Let's start with a function to cut the data into approximately the correct bins:
cut_elements <- function(j, dist)
{
# Specify the sums that we want to achieve in each partition
partition_sizes <- dist * sum(j)
# The cumulative partition sizes give us our initial cuts
partitions <- cut(cumsum(j), cumsum(c(0, partition_sizes)))
# Name our partitions according to the given distribution
levels(partitions) <- levels(cut(seq(0,1,0.001), cumsum(c(0, dist))))
# Return our partitioned data as a data frame.
data.frame(data = j, group = partitions)
}
We want a way to assess how close this approximation (and subsequent approximations) are to our answer. We can plot against the target distribution, but it will also be helpful to have a numerical figure to assess the goodness of fit to include on our plot. Here, I will use the sum of the squares of the differences between the sample bins and the target bins. We'll use the log to make the numbers more comparable. The lower the number, the better the fit.
library(dplyr)
library(ggplot2)
library(tidyr)
compare_to_distribution <- function(df, dist, title = "Comparison")
{
df %>%
group_by(group) %>%
summarise(estimate = sum(data)/sum(j)) %>%
mutate(group = factor(cumsum(dist))) %>%
mutate(target = dist) %>%
pivot_longer(cols = c(estimate, target)) ->
plot_info
log_ss <- log(sum((plot_info$value[plot_info$name == "estimate"] -
plot_info$value[plot_info$name == "target"])^2))
ggplot(data = plot_info, aes(x = group, y = value, fill = name)) +
geom_col(position = "dodge") +
labs(title = paste(title, ": log sum of squares =", round(log_ss, 2)))
}
So now we can do:
cut_elements(j, dist) %>% compare_to_distribution(dist, title = "Cuts only")
We can see that the fit is already pretty good with a simple cut of the data, but we can do a lot better by moving appropriately sized elements from the over-sized bins to the under-sized bins. We do this iteratively until no more moves will improve our fit. We use two nested while loops, which should make us worry about computation time, but we have started with a close match, so we shouldn't get too many moves before the loop stops:
move_elements <- function(df, dist)
{
ignore_max = length(dist);
while(ignore_max > 0)
{
ignore_min = 1
match_found = FALSE
while(ignore_min < ignore_max)
{
group_diffs <- sort(tapply(df$data, df$group, sum) - dist*sum(df$data))
group_diffs <- group_diffs[ignore_min:ignore_max]
too_big <- which.max(group_diffs)
too_small <- which.min(group_diffs)
swap_size <- (group_diffs[too_big] - group_diffs[too_small])/2
which_big <- which(df$group == names(too_big))
candidate_row <- which_big[which.min(abs(swap_size - df[which_big, 1]))]
if(df$data[candidate_row] < 2 * swap_size)
{
df$group[candidate_row] <- names(too_small)
ignore_max <- length(dist)
match_found <- TRUE
break
}
else
{
ignore_min <- ignore_min + 1
}
}
if (match_found == FALSE) ignore_max <- ignore_max - 1
}
return(df)
}
Let's see what that has done:
cut_elements(j, dist) %>%
move_elements(dist) %>%
compare_to_distribution(dist, title = "Cuts and moves")
You can see now that the match is so close we are struggling to see whether there is any difference between the target and the partitioned data. That's why we needed the numerical measure of GOF.
Still, let's get this fit as good as possible by swapping elements between columns to fine-tune them. This step is computationally expensive, but again we are already giving it a close approximation, so it shouldn't have much to do:
swap_elements <- function(df, dist)
{
ignore_max = length(dist);
while(ignore_max > 0)
{
ignore_min = 1
match_found = FALSE
while(ignore_min < ignore_max)
{
group_diffs <- sort(tapply(df$data, df$group, sum) - dist*sum(df$data))
too_big <- which.max(group_diffs)
too_small <- which.min(group_diffs)
current_excess <- group_diffs[too_big]
current_defic <- group_diffs[too_small]
current_ss <- current_excess^2 + current_defic^2
all_pairs <- expand.grid(df$data[df$group == names(too_big)],
df$data[df$group == names(too_small)])
all_pairs$diff <- all_pairs[,1] - all_pairs[,2]
all_pairs$resultant_big <- current_excess - all_pairs$diff
all_pairs$resultant_small <- current_defic + all_pairs$diff
all_pairs$sum_sq <- all_pairs$resultant_big^2 + all_pairs$resultant_small^2
improvements <- which(all_pairs$sum_sq < current_ss)
if(length(improvements) > 0)
{
swap_this <- improvements[which.min(all_pairs$sum_sq[improvements])]
r1 <- which(df$data == all_pairs[swap_this, 1] & df$group == names(too_big))[1]
r2 <- which(df$data == all_pairs[swap_this, 2] & df$group == names(too_small))[1]
df$group[r1] <- names(too_small)
df$group[r2] <- names(too_big)
ignore_max <- length(dist)
match_found <- TRUE
break
}
else ignore_min <- ignore_min + 1
}
if (match_found == FALSE) ignore_max <- ignore_max - 1
}
return(df)
}
Let's see what that has done:
cut_elements(j, dist) %>%
move_elements(dist) %>%
swap_elements(dist) %>%
compare_to_distribution(dist, title = "Cuts, moves and swaps")
Pretty close to identical. Let's quantify that:
tapply(df$data, df$group, sum)/sum(j)
# (0,0.3] (0.3,0.5] (0.5,0.6] (0.6,0.65] (0.65,0.715] (0.715,0.9]
# 0.30000025 0.20000011 0.10000014 0.05000010 0.06499946 0.18500025
# (0.9,1]
# 0.09999969
So, we have an exceptionally close match: each partition is less than one part in one million away from the target distribution. Quite impressive considering we only had 500 measurements to put into 7 bins.
In terms of retrieving your data, we haven't touched the ordering of j within the data frame df:
all(df$data == j)
# [1] TRUE
and the partitions are all contained within df$group. So if we want a single function to return just the partitions of j given dist, we can just do:
partition_to_distribution <- function(data, distribution)
{
cut_elements(data, distribution) %>%
move_elements(distribution) %>%
swap_elements(distribution) %>%
`[`(,2)
}
In conclusion, we have created an algorithm that creates an exceptionally close match. However, that's no good if it takes too long to run. Let's test it out:
microbenchmark::microbenchmark(partition_to_distribution(j, dist), times = 100)
# Unit: milliseconds
# expr min lq mean median uq
# partition_to_distribution(j, dist) 47.23613 47.56924 49.95605 47.78841 52.60657
# max neval
# 93.00016 100
Only 50 milliseconds to fit 500 samples. Seems good enough for most applications. It would grow exponentially with larger samples (about 10 seconds on my PC for 10,000 samples), but by that point the relative fineness of the samples means that cut_elements %>% move_elements already gives you a log sum of squares of below -30 and would therefore be an exceptionally good match without the fine tuning of swap_elements. These would only take about 30 ms with 10,000 samples.
To add to the excellent answer by #AllanCameron, here is a solution that utilizes the highly efficient function comboGeneral from RcppAlgos*.
library(RcppAlgos)
partDist <- function(v, d, tol_ratio = 0.0001) {
tot_sum <- d * sum(v)
orig_len <- length(v)
tot_len <- d * orig_len
df <- do.call(rbind, lapply(1L:(length(d) - 1L), function(i) {
len <- as.integer(tot_len[i])
vals <- comboGeneral(v, len,
constraintFun = "sum",
comparisonFun = "==",
limitConstraints = tot_sum[i],
tolerance = tol_ratio * tot_sum[i],
upper = 1)
ind <- match(vals, v)
v <<- v[-ind]
data.frame(data = as.vector(vals), group = rep(paste0("g", i), len))
}))
len <- orig_len - nrow(df)
rbind(df, data.frame(data = v,
group = rep(paste0("g", length(d)), len)))
}
The idea is that we find a subset of v (e.g. j in the OP's case) such that the sum is within a tolerance of sum(v) * d[i] for some index i (d is equivalent to dist in the OP's example). After we find a solution (N.B. we are putting a cap on the number of solutions by setting upper = 1), we assign them to a group, and then remove them from v. We then iterate until we are left with just enough elements in v that will be assigned to the last distributed value (e.g. dist[length[dist]].
Here is an example using the OP's data:
set.seed(345)
j <- runif(500,0,10000000)
dist <- c(.3,.2,.1,.05,.065,.185,.1)
system.time(df_op <- partDist(j, dist, 0.0000001))
user system elapsed
0.019 0.000 0.019
And using the function for plotting by #AllanCameron we have:
df_op %>% compare_to_distribution(dist, "RcppAlgos OP Ex")
What about a larger sample with the same distribution:
set.seed(123)
j <- runif(10000,0,10000000)
## N.B. Very small ratio
system.time(df_huge <- partDist(j, dist, 0.000000001))
user system elapsed
0.070 0.000 0.071
The results:
df_huge %>% compare_to_distribution(dist, "RcppAlgos Large Ex")
As you can see, the solutions scales very well. We can speed up execution by loosening tol_ratio at the expense of the quality of the result.
For reference with the large data set, the solution given by #AllanCameron takes just under 3 seconds and gives a similar log sum of squares values (~44):
system.time(allan_large <- partition_to_distribution(j, dist))
user system elapsed
2.261 0.675 2.938
* I am the author of RcppAlgos

In R distance between two sentences: Word-level comparison by minimum edit distance

While trying to learn R, I want to implement the algorithm below in R. Consider the two lists below:
List 1: "crashed", "red", "car"
List 2: "crashed", "blue", "bus"
I want to find out how many actions it would take to transform 'list1' into 'list2'.
As you can see I need only two actions:
1. Replace "red" with "blue".
2. Replace "car" with "bus".
But, how we can find the number of actions like this automatically.
We can have several actions to transform the sentences: ADD, REMOVE, or REPLACE the words in the list.
Now, I will try my best to explain how the algorithm should work:
At the first step: I will create a table like this:
rows: i= 0,1,2,3,
columns: j = 0,1,2,3
(example: value[0,0] = 0 , value[0, 1] = 1 ...)
crashed red car
0 1 2 3
crashed 1
blue 2
bus 3
Now, I will try to fill the table. Please, note that each cell in the table shows the number of actions we need to do to reformat the sentence (ADD, remove, or replace).
Consider the interaction between "crashed" and "crashed" (value[1,1]), obviously we don't need to change it so the value will be '0'. Since they are the same words. Basically, we got the diagonal value = value[0,0]
crashed red car
0 1 2 3
crashed 1 0
blue 2
bus 3
Now, consider "crashed" and the second part of the sentence which is "red". Since they are not the same word we can use calculate the number of changes like this :
min{value[0,1] , value[0,2] and value[1,1]} + 1
min{ 1, 2, 0} + 1 = 1
Therefore, we need to just remove "red".
So, the table will look like this:
crashed red car
0 1 2 3
crashed 1 0 1
blue 2
bus 3
And we will continue like this :
"crashed" and "car" will be :
min{value[0,3], value[0,2] and value[1,2]} + 1
min{3, 2, 1} +1 = 2
and the table will be:
crashed red car
0 1 2 3
crashed 1 0 1 2
blue 2
bus 3
And we will continue to do so. the final result will be :
crashed red car
0 1 2 3
crashed 1 0 1 2
blue 2 1 1 2
bus 3 2 2 2
As you can see the last number in the table shows the distance between two sentences: value[3,3] = 2
Basically, the algorithm should look like this:
if (characters_in_header_of_matrix[i]==characters_in_column_of_matrix [j] &
value[i,j] == value[i+1][j-1] )
then {get the 'DIAGONAL VALUE' #diagonal value= value[i, j-1]}
else{
value[i,j] = min(value[i-1, j], value[i-1, j-1], value[i, j-1]) + 1
}
endif
for finding the difference between the elements of two lists that you can see in the header and the column of the matrix, I have used the strcmp() function which will give us a boolean value(TRUE or FALSE) while comparing the words. But, I fail at implementing this.
I'd appreciate your help on this one, thanks.
The question
After some clarification in a previous post, and after the update of the post, my understanding is that Zero is asking: 'how one can iteratively count the number of word differences in two strings'.
I am unaware of any implementation in R, although i would be surprised if i doesn't already exists. I took a bit of time out to create a simple implementation, altering the algorithm slightly for simplicity (For anyone not interested scroll down for 2 implementations, 1 in pure R, one using the smallest amount of Rcpp). The general idea of the implementation:
Initialize with string_1 and string_2 of length n_1 and n_2
Calculate the cumulative difference between the first min(n_1, n_2) elements,
Use this cumulative difference as the diagonal in the matrix
Set the first off-diagonal element to the very first element + 1
Calculate the remaining off diagonal elements as: diag(i) - diag(i-1) + full_matrix(i-1,j)
In the previous step i iterates over diagonals, j iterates over rows/columns (either one works), and we start in the third diagonal, as the first 2x2 matrix is filled in step 1 to 4
Calculate the remaining abs(n_1 - n_2) elements as full_matrix[,min(n_1 - n_2)] + 1:abs(n_1 - n_2), applying the latter over each value in the prior, and bind them appropriately to the full_matrix.
The output is a matrix with dimensions row and column names of the corresponding strings, which has been formatted for some easier reading.
Implementation in R
Dist_between_strings <- function(x, y,
split = " ",
split_x = split, split_y = split,
case_sensitive = TRUE){
#Safety checks
if(!is.character(x) || !is.character(y) ||
nchar(x) == 0 || nchar(y) == 0)
stop("x, y needs to be none empty character strings.")
if(length(x) != 1 || length(y) != 1)
stop("Currency the function is not vectorized, please provide the strings individually or use lapply.")
if(!is.logical(case_sensitive))
stop("case_sensitivity needs to be logical")
#Extract variable names of our variables
# used for the dimension names later on
x_name <- deparse(substitute(x))
y_name <- deparse(substitute(y))
#Expression which when evaluated will name our output
dimname_expression <-
parse(text = paste0("dimnames(output) <- list(",make.names(x_name, unique = TRUE)," = x_names,",
make.names(y_name, unique = TRUE)," = y_names)"))
#split the strings into words
x_names <- str_split(x, split_x, simplify = TRUE)
y_names <- str_split(y, split_y, simplify = TRUE)
#are we case_sensitive?
if(isTRUE(case_sensitive)){
x_split <- str_split(tolower(x), split_x, simplify = TRUE)
y_split <- str_split(tolower(y), split_y, simplify = TRUE)
}else{
x_split <- x_names
y_split <- y_names
}
#Create an index in case the two are of different length
idx <- seq(1, (n_min <- min((nx <- length(x_split)),
(ny <- length(y_split)))))
n_max <- max(nx, ny)
#If we have one string that has length 1, the output is simplified
if(n_min == 1){
distances <- seq(1, n_max) - (x_split[idx] == y_split[idx])
output <- matrix(distances, nrow = nx)
eval(dimname_expression)
return(output)
}
#If not we will have to do a bit of work
output <- diag(cumsum(ifelse(x_split[idx] == y_split[idx], 0, 1)))
#The loop will fill in the off_diagonal
output[2, 1] <- output[1, 2] <- output[1, 1] + 1
if(n_max > 2)
for(i in 3:n_min){
for(j in 1:(i - 1)){
output[i,j] <- output[j,i] <- output[i,i] - output[i - 1, i - 1] + #are the words different?
output[i - 1, j] #How many words were different before?
}
}
#comparison if the list is not of the same size
if(nx != ny){
#Add the remaining words to the side that does not contain this
additional_words <- seq(1, n_max - n_min)
additional_words <- sapply(additional_words, function(x) x + output[,n_min])
#merge the additional words
if(nx > ny)
output <- rbind(output, t(additional_words))
else
output <- cbind(output, additional_words)
}
#set the dimension names,
# I would like the original variable names to be displayed, as such i create an expression and evaluate it
eval(dimname_expression)
output
}
Note that the implementation is not vectorized, and as such can only take single string inputs!
Testing the implementation
To test the implementation, one could use the strings given. As they were said to be contained in lists, we will have to convert them to strings. Note that the function lets one split each string differently, however it assumes space separated strings. So first I'll show how one could achieve a conversion to the correct format:
list_1 <- list("crashed","red","car")
list_2 <- list("crashed","blue","bus")
string_1 <- paste(list_1,collapse = " ")
string_2 <- paste(list_2,collapse = " ")
Dist_between_strings(string_1, string_2)
output
#Strings in the given example
string_2
string_1 crashed blue bus
crashed 0 1 2
red 1 1 2
car 2 2 2
This is not exactly the output, but it yields the same information, as the words are ordered as they were given in the string.
More examples
Now i stated it worked for other strings as well and this is indeed the fact, so lets try some random user-made strings:
#More complicated strings
string_3 <- "I am not a blue whale"
string_4 <- "I am a cat"
string_5 <- "I am a beautiful flower power girl with monster wings"
string_6 <- "Hello"
Dist_between_strings(string_3, string_4, case_sensitive = TRUE)
Dist_between_strings(string_3, string_5, case_sensitive = TRUE)
Dist_between_strings(string_4, string_5, case_sensitive = TRUE)
Dist_between_strings(string_6, string_5)
Running these show that these do yield the correct answers. Note that if either string is of size 1, the comparison is a lot faster.
Benchmarking the implementation
Now as the implementation is accepted, as correct, we would like to know how well it performs (For the uninterested reader, one can scroll past this section, to where a faster implementation is given). For this purpose, i will use much larger strings. For a complete benchmark i should test various string sizes, but for the purposes i will only use 2 rather large strings of size 1000 and 2500. For this purpose i use the microbenchmark package in R, which contains a microbenchmark function, which claims to be accurate down to nanoseconds. The function itself executes the code 100 (or a user defined) number of times, returning the mean and quartiles of the run times. Due to other parts of R such as the Garbage Cleaner, the median is mostly considered a good estimate of the actual average run-time of the function.
The execution and results are shown below:
#Benchmarks for larger strings
set.seed(1)
string_7 <- paste(sample(LETTERS,1000,replace = TRUE), collapse = " ")
string_8 <- paste(sample(LETTERS,2500,replace = TRUE), collapse = " ")
microbenchmark::microbenchmark(String_Comparison = Dist_between_strings(string_7, string_8, case_sensitive = FALSE))
# Unit: milliseconds
# expr min lq mean median uq max neval
# String_Comparison 716.5703 729.4458 816.1161 763.5452 888.1231 1106.959 100
Profiling
Now i find the run-times very slow. One use case for the implementation could be an initial check of student hand-ins to check for plagiarism, in which case a low difference count very likely shows plagiarism. These can be very long and there may be hundreds of handins, an as such i would like the run to be very fast.
To figure out how to improve my implementation i used the profvis package with the corrosponding profvis function. To profile the function i exported it in another R script, that i sourced, running the code 1 once prior to profiling to compile the code and avoid profiling noise (important). The code to run the profiling can be seen below, and the most important part of the output is visualized in an image below it.
library(profvis)
profvis(Dist_between_strings(string_7, string_8, case_sensitive = FALSE))
Now, despite the colour, here i can see a clear problem. The loop filling the off-diagonal by far is responsible for most of the runtime. R (like python and other not compiled languages) loops are notoriously slow.
Using Rcpp to improve performance
To improve the implementation, we could implement the loop in c++ using the Rcpp package. This is rather simple. The code is not unlike the one we would use in R, if we avoid iterators. A c++ script can be made in file -> new file -> c++ File. The following c++ code would be pasted into the corrosponding file and sourced using the source button.
//Rcpp Code
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix Cpp_String_difference_outer_diag(NumericMatrix output){
long nrow = output.nrow();
for(long i = 2; i < nrow; i++){ // note the
for(long j = 0; j < i; j++){
output(i, j) = output(i, i) - output(i - 1, i - 1) + //are the words different?
output(i - 1, j);
output(j, i) = output(i, j);
}
}
return output;
}
The corresponding R function needs to be altered to use this function instead of looping. The code is similar to the first function, only switching the loop for a call to the c++ function.
Dist_between_strings_cpp <- function(x, y,
split = " ",
split_x = split, split_y = split,
case_sensitive = TRUE){
#Safety checks
if(!is.character(x) || !is.character(y) ||
nchar(x) == 0 || nchar(y) == 0)
stop("x, y needs to be none empty character strings.")
if(length(x) != 1 || length(y) != 1)
stop("Currency the function is not vectorized, please provide the strings individually or use lapply.")
if(!is.logical(case_sensitive))
stop("case_sensitivity needs to be logical")
#Extract variable names of our variables
# used for the dimension names later on
x_name <- deparse(substitute(x))
y_name <- deparse(substitute(y))
#Expression which when evaluated will name our output
dimname_expression <-
parse(text = paste0("dimnames(output) <- list(", make.names(x_name, unique = TRUE)," = x_names,",
make.names(y_name, unique = TRUE)," = y_names)"))
#split the strings into words
x_names <- str_split(x, split_x, simplify = TRUE)
y_names <- str_split(y, split_y, simplify = TRUE)
#are we case_sensitive?
if(isTRUE(case_sensitive)){
x_split <- str_split(tolower(x), split_x, simplify = TRUE)
y_split <- str_split(tolower(y), split_y, simplify = TRUE)
}else{
x_split <- x_names
y_split <- y_names
}
#Create an index in case the two are of different length
idx <- seq(1, (n_min <- min((nx <- length(x_split)),
(ny <- length(y_split)))))
n_max <- max(nx, ny)
#If we have one string that has length 1, the output is simplified
if(n_min == 1){
distances <- seq(1, n_max) - (x_split[idx] == y_split[idx])
output <- matrix(distances, nrow = nx)
eval(dimname_expression)
return(output)
}
#If not we will have to do a bit of work
output <- diag(cumsum(ifelse(x_split[idx] == y_split[idx], 0, 1)))
#The loop will fill in the off_diagonal
output[2, 1] <- output[1, 2] <- output[1, 1] + 1
if(n_max > 2)
output <- Cpp_String_difference_outer_diag(output) #Execute the c++ code
#comparison if the list is not of the same size
if(nx != ny){
#Add the remaining words to the side that does not contain this
additional_words <- seq(1, n_max - n_min)
additional_words <- sapply(additional_words, function(x) x + output[,n_min])
#merge the additional words
if(nx > ny)
output <- rbind(output, t(additional_words))
else
output <- cbind(output, additional_words)
}
#set the dimension names,
# I would like the original variable names to be displayed, as such i create an expression and evaluate it
eval(dimname_expression)
output
}
Testing the c++ implementation
To be sure the implementation is correct we check if the same output is obtained with the c++ implementation.
#Test the cpp implementation
identical(Dist_between_strings(string_3, string_4, case_sensitive = TRUE),
Dist_between_strings_cpp(string_3, string_4, case_sensitive = TRUE))
#TRUE
Final benchmarks
Now is this actually faster? To see this we could run another benchmark using the microbenchmark package. The code and results are shown below:
#Final microbenchmarking
microbenchmark::microbenchmark(R = Dist_between_strings(string_7, string_8, case_sensitive = FALSE),
Rcpp = Dist_between_strings_cpp(string_7, string_8, case_sensitive = FALSE))
# Unit: milliseconds
# expr min lq mean median uq max neval
# R 721.71899 753.6992 850.21045 787.26555 907.06919 1756.7574 100
# Rcpp 23.90164 32.9145 54.37215 37.28216 47.88256 243.6572 100
From the microbenchmark median improvement factor of roughly 21 ( = 787 / 37), which is a massive improvement from just implementing a single loop!
There is already an edit-distance function in R we can take advantage of: adist().
As it works on the character level, we'll have to assign a character to each unique word in our sentences, and stitch them together to form pseudo-words we can calculate the distance between.
s1 <- c("crashed", "red", "car")
s2 <- c("crashed", "blue", "bus")
ll <- list(s1, s2)
alnum <- c(letters, LETTERS, 0:9)
ll2 <- relist(alnum[factor(unlist(ll))], ll)
ll2 <- sapply(ll2, paste, collapse="")
adist(ll2)
# [,1] [,2]
# [1,] 0 2
# [2,] 2 0
Main limitation here, as far as I can tell, is the number of unique characters available, which in this case is 62, but can be extended quite easily, depending on your locale. E.g: intToUtf8(c(32:126, 161:300), TRUE).

RStudio not responding with large Linear Optimization

I have a large two-step optimization problem that I've tried to simplify for this question. The first step is to choose 10 elements to maximize utility with certain constraints. I need 200 of these sets, but due to the nature of what I'm trying to do, there needs to be 600 generated so that the correct combinations can manifest.
Wrapping around these mini-optimization problems is a larger constraint where each individual element can only be used within a certain range. The first optimization tunes each element's utility so that each one is relatively close to the bounds, but it's not possible for all of them to be within their bounds. Therefore, the second step is to choose 200 of the 600 sets such that each individual element's min/max usage is satisfied. This is what I need help with.
I made a function using lpSolve that works, but over 80% of the time it freezes RStudio and it's just becoming too much of a hassle - I need to either improve my current approach, or need a completely new approach. I don't know if lpSolve is really the best approach to begin with. While I do have an overall set-score that I can maximize, all I really care about is having each element within the bounds. I've made a simplified example to get at the essence of my problem.
I'm in charge of making 200 meals from a set of 80 different fruits. Each meal uses 10 fruit and cannot have more than 1 of the same fruit. I'm limited in the number of fruits that I have (and my boss is making me use a minimum of each fruit otherwise they'll go bad), so they need to be within certain bounds. I have a list of 600 meals already created (Meals), and each one has it's own unique Health-Score. Ideally I would like to maximize the Health-Score, but obviously the most important piece is that each fruit is used the correct number of times, otherwise the meals can't be made in the first place.
Here's my code to 1) Setup the 600 Meals (random) 2) Set the min/max times each fruit must be used (random) 3) Run a linear optimization to select 200 of the 600 meals such that the individual fruit constraints are fulfilled. The program tries to chose 200 of the 600, but if the constraints don't allow it, then it loosens the constraints (e.g. if the solver doesn't work the first time, then I'll decrease the minimum number of times an Apple can be used, and increase the maximum number of times it can be used). It does this one fruit at a time, rather than all at once. Eventually the constraints should be loosened so much that any 200 of the 600 will work (i.e. when all fruit minPercent is less than 0 and all fruit maxPercent is greater than 100), but it doesn't matter because R freezes up.
library(stringr)
library(dplyr)
library(lpSolve)
# Inputs
MealsNeeded <- 200
Buffer <- 3
# Setup the meals (this is the output of another optimizer in my actual program. Considered "Step 1" as I mentioned above)
Meals <- data.frame()
for(i in 1:(MealsNeeded*Buffer)){
run <- i
meal <- sample(fruit, 10)
healthFactor <- round(runif(1, 10, 30), 0) #(Health factor for the entire meal)
df <- data.frame(Run = run, Fruit = meal, healthFactor = healthFactor, stringsAsFactors = FALSE)
Meals <- rbind(Meals, df)
}
# The minimum/maximum number of times each fruit must be used across all 200 meals (these would be inputs in my program)
set.seed(11)
fruitDF <- data.frame(Name = fruit, minSelectPct = round(runif(length(fruit), .05, .1)*100, 0), stringsAsFactors = FALSE) %>%
mutate(maxSelectPct = round(minSelectPct/2 + runif(length(fruit), .05, .1)*100, 0))
#### Actual Program Start
# Get objective
obj <- Meals %>%
distinct(Run, healthFactor) %>%
ungroup() %>%
select(healthFactor) %>%
pull()
# Dummy LU - for each fruit give 1/0 whether or not they were in the meal
dummyLUInd <- data.frame(FruitName = fruitDF$Name, stringsAsFactors = FALSE)
for(i in unique(Meals$Run)){
selectedFruit <- Meals %>%
filter(Run == i) %>%
select(Fruit) %>%
mutate(Indicator = 1)
dummyLUIndTemp <- fruitDF %>%
left_join(selectedFruit, by = c('Name' = 'Fruit')) %>%
mutate(Indicator = ifelse(is.na(Indicator), 0, Indicator)) %>%
select(Indicator)
dummyLUInd <- cbind(dummyLUInd, dummyLUIndTemp)
}
## Table create
dummyLUInd <- rbind(dummyLUInd, dummyLUInd)[,-1]
dummyLUInd <- as.data.frame(t(dummyLUInd))
dummyLUInd$Total = 1
## Directions
dirLT <- c(rep('<=', (ncol(dummyLUInd)-1)/2))
dirGT <- c(rep('>=', (ncol(dummyLUInd)-1)/2))
## Multiply percentages by total Meals
MinExp = round(fruitDF$minSelectPct/100 * MealsNeeded - 0.499, 0)
MaxExp = round(fruitDF$maxSelectPct/100 * MealsNeeded + 0.499, 0)
# Setup constraints like # of tries
CounterMax <- 10000
LPSum = 0
Counter = 0
# Create DF to make it easier to change constraints for each run
MinExpDF <- data.frame(Place = 1:length(MinExp), MinExp = MinExp)
MaxExpDF <- data.frame(Place = 1:length(MaxExp), MaxExp = MaxExp)
cat('\nStarting\n')
Sys.sleep(2)
# Try to get the 200 of 600 Meals that satisfy the constraints for the individual Fruit.
# If the solution doesn't exist, loosen the constraints for each fruit (one at a time) until it does work
while (LPSum == 0 & Counter <= CounterMax) {
rowUse <- Counter %% length(MaxExp)
# Knock one of minimum, starting with highest exposure, one at a time
MinExpDF <- MinExpDF %>%
mutate(Rank = rank(-MinExp, na.last = FALSE, ties.method = "first"),
MinExp = ifelse(Rank == rowUse, MinExp - 1, MinExp)
)
MinExp <- MinExpDF$MinExp
# Add one of maximum, starting with highest exposure, one at a time
MaxExpDF <- MaxExpDF %>%
mutate(Rank = rank(-MaxExp, na.last = FALSE, ties.method = "first"),
MaxExp = ifelse(Rank == rowUse, MaxExp + 1, MaxExp))
MaxExp <- MaxExpDF$MaxExp
# Solve
dir <- 'max'
f.obj <- obj
f.mat <- t(dummyLUInd)
f.dir <- c(dirGT, dirLT, '==')
f.rhs <- c(MinExp, MaxExp, MealsNeeded)
Sol <- lp(dir, f.obj, f.mat, f.dir, f.rhs, all.bin = T)$solution
LPSum <- sum(Sol)
Counter = Counter + 1
if(Counter %% 10 == 0) cat(Counter, ', ', sep = '')
}
# Get the Run #'s from the lpSolve
if(Counter >= CounterMax){
cat("Unable to find right exposure, returning all Meals\n")
MealsSolved <- Meals
} else {
MealsSolved <- data.frame(Run = unique(Meals$Run))
MealsSolved$selected <- Sol
MealsSolved <- MealsSolved[MealsSolved$selected == 1,]
}
# Final Meals
FinalMeals <- Meals %>%
filter(Run %in% MealsSolved$Run)
If you run this code enough times, eventually RStudio will freeze up on you (at least it does for me, if it doesn't for you I suppose increase the number of Meals). It happens during the actual lp, so there's really not much you can do since it's really C code. This is where I'm lost.
A part of me thinks this isn't really an lpSolve issue since I'm not really trying to maximize anything (Health-Factor isn't all too important). My real "loss function" is the number of times each fruit goes above/below their min/max exposure, but I can't think of how to set something like this up. Can my current approach work, or do I need to do something different completely?

R - Dealing with zeros in radomized subsamples

I've run into a little problem, simulating the throw of dice. Basically im doing this to get familiar with loops and their output.
Intention is to simulate the throw of two dice as follows:
R = 100
d6 = c(1:6)
d = 60
DICE = NULL
for (i in 1:R)
{
i <- as.factor((sample(d6, size=d, replace = T)) + (sample(d6, size=d, replace = T)))
j <- summary(i)
DICE = rbind(DICE, j)
}
head(DICE)
HIS = colMeans(DICE)
boxplot(DICE)
title(main= "Result 2d6", ylab= "Throws", xlab="")
relHIS = (HIS / sum(HIS))*100
relHIS
Problems occur if the result in one cathegorie is 0 (result did not occur in the sample). If this happens randomly in the first subsample one or more the categories (numbers 2-12) are missing. This causes problems ("number of columns of result is not a multiple of vector length (arg 2)") in the following subsamples.
Im sure there is a really simple solution for this, by defining everything beforehand...
Thanks for your help!
Here are some fixes:
R = 100
d6 = c(1:6)
d = 60
DICE = matrix(nrow = R, ncol = 11) #pre-allocate
colnames(DICE) <- 2:12
for (i in 1:R)
{
sim <- ordered((sample(d6, size=d, replace = T)) + (sample(d6, size=d, replace = T)),
levels = 2:12) #define the factor levels
sumsim <- table(sim)
DICE[i,] <- sumsim #sub-assign
}
head(DICE)
HIS = colMeans(DICE)
boxplot(DICE)
title(main= "Result 2d6", ylab= "Throws", xlab="")
prop.table(HIS) * 100
Always pre-allocate your result data structure. Growing it in a loop is terribly slow and you know how big it needs to be. Also, don't use the same symbol for the iteration variable and something else.
Omit as.factor()in your seventh row

Resources