I've constructed a data.frame using the inefficient code below. Can you improve it, noting that if you can think of a better starting point, please include that answer.
My code takes data from the first two data frames and combines them to give the third. The first data.frame is a grid of 1s and -1s representing low or high values. The second data.frame includes all the information for me to calculate the high or low values. Note that each column has similar calculations but the calculation may differ from column to column.
## Exampple Data for Question
## How to prepare a 3rd data.frame from two others
## Prepare 1st data.frame, a Yates table
A <- B <- C <- c(1,-1)
yates.1 <- expand.grid(A = A, B = B, C = C)
## Prepare 2nd data.frame with the reaction data
reaction.info <- data.frame(stringsAsFactors = FALSE,
factor.name = c("A", "B", "C"),
Component = c("Water", "SM", "Reagent"),
Mw = c(18, 36.5, 40),
centre.point = c(20, 1.4, 1.45),
positive.point = c(22, 1.54, 1.595),
negative.point = c(18, 1.26, 1.305))
## Prepare 3rd data.frame to be filled
reaction.quants <- as.data.frame(matrix(NA, dim(yates.1)[1], dim(yates.1)[2]))
names(reaction.quants) <- reaction.info[,2]
reaction.quants[yates.1[,1] == 1 ,1] <- round(5 * reaction.info[1,3] * reaction.info[1,5], 3)
reaction.quants[yates.1[,1] == -1 ,1] <- round(5 * reaction.info[1,3] * reaction.info[1,6], 3)
reaction.quants[yates.1[,2] == 1 ,2] <- round(5 * reaction.info[2,3] * reaction.info[2,5], 3)
reaction.quants[yates.1[,2] == -1 ,2] <- round(5 * reaction.info[2,3] * reaction.info[2,6], 3)
reaction.quants[yates.1[,3] == 1 ,3] <- round(5 * reaction.info[3,3] * reaction.info[3,5], 3)
reaction.quants[yates.1[,3] == -1 ,3] <- round(5 * reaction.info[3,3] * reaction.info[3,6], 3)
## three data.frames
yates
reaction.info
reaction.quants
Consider refactoring with ifelse logic and filtered vectors using a user defined method since logic is very similar but across different columns:
convert_col <- function(nm) {
with(reaction.info,
ifelse(reaction.quants_new[[nm]] == 1,
round(5 * Mw[Component==nm] * positive.point[Component==nm], 3),
round(5 * Mw[Component==nm] * negative.point[Component==nm], 3)
)
)
}
# INITIALIZE DATA FRAME (CAN BE NESTED IN NEXT CALL)
reaction.quants_new <- setNames(data.frame(yates.1), reaction.info$Component)
# ADD COLUMNS (CAN ALSO USE TRANSFORM)
reaction.quants_new <- within(reaction.quants_new, {
Water <- convert_col("Water")
SM <- convert_col("SM")
Reagent <- convert_col("Reagent")
})
reaction.quants_new
# Water SM Reagent
# 1 1980 281.05 319
# 2 1620 281.05 319
# 3 1980 229.95 319
# 4 1620 229.95 319
# 5 1980 281.05 261
# 6 1620 281.05 261
# 7 1980 229.95 261
# 8 1620 229.95 261
identical(reaction.quants, reaction.quants_new)
# [1] TRUE
And with pipes available in R 4.1+ still using user-defined method:
reaction.quants_new <- data.frame(yates.1) |>
setNames(reaction.info$Component) |>
within({
Water <- convert_col("Water")
SM <- convert_col("SM")
Reagent <- convert_col("Reagent")
})
Related
I want to replace a vector in a dataframe that contains only 4 numbers to specific numbers as shown below
tt <- rep(c(1,2,3,4), each = 10)
df <- data.frame(tt)
I want to replace 1 = 10; 2 = 200, 3 = 458, 4 = -0.1
You could use recode from dplyr. Note that the old values are written as character. And the new values are integers since the original column was integer:
library(tidyverse):
df %>%
mutate(tt = recode(tt, '1'= 10, '2' = 200, '3' = 458, '4' = -0.1))
tt
1 10.0
2 10.0
3 200.0
4 200.0
5 458.0
6 458.0
7 -0.1
8 -0.1
To correct the error in the code in the question and provide for a shorter example we use the input in the Note at the end. Here are several alternatives. nos defined in (1) is used in some of the others too. No packages are used.
1) indexing To get the result since the input is 1 to 4 we can use indexing. This is probably the simplest solution given that the original values of tt are in 1:4.
nos <- c(10, 200, 458, -0.1)
transform(df, tt = nos[tt])
## tt
## 1 10.0
## 2 10.0
## 3 200.0
## 4 200.0
## 5 458.0
## 6 458.0
## 7 -0.1
## 8 -0.1
1a) If the input is not necessarily in 1:4 then we could use this generalization
transform(df, tt = nos[match(tt, 1:4)])
2) arithmetic Another approach is to use arithmetic:
transform(df, tt = 10 * (tt == 1) +
200 * (tt == 2) +
458 * (tt == 3) +
-0.1 * (tt == 4))
3) outer/matrix multiplication This would also work:
transform(df, tt = c(outer(tt, 1:4, `==`) %*% nos))
3a) This is the same except we use model.matrix instead of outer.
transform(df, tt = c(model.matrix(~ factor(tt) + 0, df) %*% nos))
4) factor The levels of the factor are 1:4 and the corresponding labels are defined by nos. Extract the labels using format and then convert them to numeric.
transform(df, tt = as.numeric(format(factor(tt, levels = 1:4, labels = nos))))
4a) or as a pipeline
transform(df, tt = tt |>
factor(levels = 1:4, labels = nos) |>
format() |>
as.numeric())
5) loop We can use a simple loop. Nulling out i at the end is so that it is not made into a column.
within(df, { for(i in 1:4) tt[tt == i] <- nos[i]; i <- NULL })
6) Reduce This is somewhat similar to (5) but implements the loop using Reduce.
fun <- function(tt, i) replace(tt, tt == i, nos[i])
transform(df, tt = Reduce(fun, init = tt, 1:4))
Note
df <- data.frame(tt = c(1, 1, 2, 2, 3, 3, 4, 4))
I am running a Montecarlo simulation of a multinomial logit. Therefore I have a function that generates the data and estimates the model. Additionally, I want to generate different datasets over a grid of values. In particular, changing both the number of individuals (n.indiv) and the number of answers by each individual (n.choices).
So far, I have managed to solve it, but at some point, I incurred into a nested for-loop structure over a grid search of the possible values for the number of individuals (n.indiv_list) and the number of answers by each individual(n.choices_list). Finally, I am quite worried about the efficiency of the usage of my last bit of code with the double for-loop structure running on the combinations of the possible values. Probably there is a vectorized way to do it that I am missing (or maybe not?).
Finally, and this is mostly a matter of style, I managed to arrive a multiples objects that contain the models from the combinations of the grid search with informative names, but also would be great if I could collapse all of them in a list but with the current structure, I am not sure how to do it. Thank you in advance!
1) Function that generates data and estimates the model.
library(dplyr)
library(VGAM)
library(mlogit)
#function that generates the data and estimates the model.
mlogit_sim_data <- function(...){
# generating number of (n.alter) X (n.choices)
df <- data.frame(id= rep(seq(1,n.choices ),n.alter ))
# id per individual
df <- df %>%
group_by(id) %>%
mutate(altern = sequence(n()))%>%
arrange(id)
#Repeated scheme for each individual + id_ind
df <- cbind(df[rep(1:nrow(df), n.indiv), ], id_ind = rep(1:n.indiv, each = nrow(df)))
## creating attributes
df<- df %>%
mutate(
x1=rlnorm(n.indiv*n.alter),
x2=rlnorm(n.indiv*n.alter),
)%>%
group_by(altern) %>%
mutate(
id_choice = sequence(n()))%>%
group_by(id_ind) %>%
mutate(
z1 = rpois(1,lambda = 25),
z2 = rlnorm(1,meanlog = 5, sdlog = 0.5),
z3 = ifelse(runif(1, min = 0 , max = 1) > 0.5 , 1 , 0)
)
# Observed utility
df$V1 <- with(df, b1 * x1 + b2 * x2 )
#### Generate Response Variable ####
fn_choice_generator <- function(V){
U <- V + rgumbel(length(V), 0, 1)
1L * (U == max(U))
}
# Using fn_choice_generator to generate 'choice' columns
df <- df %>%
group_by(id_choice) %>%
mutate(across(starts_with("V"),
fn_choice_generator, .names = "choice_{.col}")) %>% # generating choice(s)
select(-starts_with("V")) %>% ##drop V variables.
select(-c(id,id_ind))
tryCatch(
{
model_result <- mlogit(choice_V1 ~ 0 + x1 + x2 |1 ,
data = df,
idx = c("id_choice", "altern"))
return(model_result)
},
error = function(e){
return(NA)
}
)
}
2) Grid search over possible combinations of the data
#List with the values that varies in the simulation
#number of individuals
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
# Values that remains constant across simulations
#set number of alternatives
n.alter <- 3
## Real parameters
b1 <- 1
b2 <- 2
#Number of reps
nreps <- 10
#Set seed
set.seed(777)
#iteration over different values in the simulation
for(i in n.indiv_list) {
for(j in n.choices_list) {
n.indiv <- i
n.choices <- j
assign(paste0("m_ind_", i, "_choices_", j), lapply(X = 1:nreps, FUN = mlogit_sim_data))
}
}
You can vectorize using the map2 function of the purrr package:
library(tidyverse)
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
l1 <- length(n.indiv_list)
l2 <- length(n.choices_list)
v1 <- rep(n.indiv_list, each = l2)
v2 <- rep(n.choices_list, l1) #v1, v2 generate all pairs
> v1
[1] 1 1 1 1 1 15 15 15 15 15 100 100 100 100 100 500 500 500 500 500
> v2
[1] 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10
result <- map2(v1, v2, function(v1, v2) assign(paste0("m_ind_", v1, "_choices_", v2), lapply(X = 1:nreps, FUN = mlogit_sim_data)))
result will be a list of your function outputs.
I have a dataframe which includes 2 columns, let's say "left" and "right", which define intervals. I want to test if a given numeric "x" is part of any interval defined by the dataframe (if it is, it should be only once, those intervals don't overlap). Expected behaviour:
> df <- data.frame(id = c("A", "B", "C"), left = c(0, 50, 150), right = c(15, 78, 190))
> df
id left right
1 A 0 15
2 B 50 78
3 C 150 190
> my_function(7)
TRUE
> my_function(20)
FALSE
So I did it this way, but it's terribly slow and I'm pretty sure this could be optimized:
my_function <- function(x) {
test <- df %>% dplyr::rowwise() %>% dplyr::mutate(test = (x >= left) && (x <= right)) %>% ungroup()
test <- test %>% filter(test == T)
nrow(test) == 1
}
Then I'd be interested in getting the matching row in case the output is TRUE, but with the current function it'll take forever (the actual dataframe has ~5,000 rows, and I want to test/get coordinates for thousands of x values).
I found a library that manages interval objets but it seems it's tailored for time intervals. Any suggestion?
Here is a simple way with an example:
z <- 567 # single dummy value
left <- x1 <- seq(100, 900, 200)
right <- seq(200, 1000, 200)
df <- data.frame(left, right) # dummy intervals
lo <- z >= df$left
hi <- z <= df$right
check <- lo * hi
introw <- which(check == 1)
introw
3
z2 <- c(356, 934, 134, 597, 771) # vector of values to check
lo2 <- sapply(z2, function(x) x >= df$left)
hi2 <- sapply(z2, function(x) x <= df$right)
check2 <- lo2 * hi2
introws <- apply(check2, 2, function(x) which(x ==1))
introws #vector of intervals for each input value
introws
2 5 1 3 4
final <- cbind(value = z2, interval = introws)
final
value interval
[1,] 356 2
[2,] 934 5
[3,] 134 1
[4,] 597 3
[5,] 771 4
Try this approach using between():
#Code
my_function <- function(x) {
test <- df %>% dplyr::rowwise() %>%
dplyr::mutate(test = between(x,left,right)) %>% ungroup()
test <- test %>% filter(test == T)
nrow(test) == 1
}
I have a matrix of 1s and 0s where the rows are individuals and the columns are events. A 1 indicates that an event happened to an individual and a 0 that it did not.
I want to find which set of (in the example) 5 columns/events that cover the most rows/individuals.
Test Data
#Make test data
set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30
My attempt
My initial attempt was just based on combining the set of 5 columns with the highest colMeans:
#Get top 5 columns with highest row coverage
col_set <- head(sort(colMeans(d), decreasing = T), 5)
#Have a look the set
col_set
>
197 199 59 80 76
0.2666667 0.2666667 0.2333333 0.2333333 0.2000000
#Check row coverage of the column set
sum(apply(d[,colnames(d) %in% names(col_set)], 1, sum) > 0) / 30 #top 5
>
[1] 0.7
However this set does not cover the most rows. I tested this by pseudo-random sampling 10.000 different sets of 5 columns, and then finding the set with the highest coverage:
#Get 5 random columns using colMeans as prob in sample
##Random sample 10.000 times
set.seed(123)
result <- lapply(1:10000, function(x){
col_set2 <- sample(colMeans(d), 5, F, colMeans(d))
cover <- sum(apply(d[,colnames(d) %in% names(col_set2)], 1, sum) > 0) / 30 #random 5
list(set = col_set2, cover = cover)
})
##Have a look at the best set
result[which.max(sapply(result, function(x) x[["cover"]]))]
>
[[1]]
[[1]]$set
59 169 262 68 197
0.23333333 0.10000000 0.06666667 0.16666667 0.26666667
[[1]]$cover
[1] 0.7666667
The reason for supplying the colMeans to sample is that the columns with the highest coverages are the ones I am most interested in.
So, using pseudo-random sampling I can collect a set of columns with higher coverage than when just using the top 5 columns. However, since my actual data sets are larger than the example I am looking for a more efficient and rational way of finding the set of columns with the highest coverage.
EDIT
For the interested, I decided to microbenchmark the 3 solutions provided:
#Defining G. Grothendieck's coverage funciton outside his solutions
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
#G. Grothendieck top solution
solution1 <- function(d){
cols <- tail(as.numeric(names(sort(colSums(d)))), 20)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
}
#G. Grothendieck "Older solution"
solution2 <- function(d){
require(lpSolve)
ones <- rep(1, 300)
res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
m <- matrix(res$solution[1:3000] == 1, 300)
cols <- which(rowSums(m) > 0)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
}
#user2554330 solution
bestCols <- function(d, n = 5) {
result <- numeric(n)
for (i in seq_len(n)) {
result[i] <- which.max(colMeans(d))
d <- d[d[,result[i]] != 1,, drop = FALSE]
}
result
}
#Benchmarking...
microbenchmark::microbenchmark(solution1 = solution1(d),
solution2 = solution2(d),
solution3 = bestCols(d), times = 10)
>
Unit: microseconds
expr min lq mean median uq max neval
solution1 390811.850 497155.887 549314.385 578686.3475 607291.286 651093.16 10
solution2 55252.890 71492.781 84613.301 84811.7210 93916.544 117451.35 10
solution3 425.922 517.843 3087.758 589.3145 641.551 25742.11 10
This looks like a relatively hard optimization problem, because of the ways columns interact. An approximate strategy would be to pick the column with the highest mean; then delete the rows with ones in that column, and repeat. You won't necessarily find the best solution this way, but you should get a fairly good one.
For example,
set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30
bestCols <- function(d, n = 5) {
result <- numeric(n)
for (i in seq_len(n)) {
result[i] <- which.max(colMeans(d))
d <- d[d[,result[i]] != 1,, drop = FALSE]
}
cat("final dim is ", dim(d))
result
}
col_set <- bestCols(d)
sum(apply(d[,colnames(d) %in% col_set], 1, sum) > 0) / 30 #top 5
This gives 90% coverage.
The following provides a heuristic to find an approximate solution. Find the N=20 columns, say, with the most ones, cols, and then use brute force to find every subset of 5 columns out of those 20. The subset having the highest coverage is shown below and its coverage is 93.3%.
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
N <- 20
cols <- tail(as.numeric(names(sort(colSums(d)))), N)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1] 90 123 197 199 286
coverage(co[, itop])
## [1] 0.9333333
Repeating this for N=5, 10, 15 and 20 we get coverages of 83.3%, 86.7%, 90% and 93.3%. The higher the N the better the coverage but the lower the N the less the run time.
Older solution
We can approximate the problem with a knapsack problem that chooses the 5 columns with largest numbers of ones using integer linear programming.
We get the 10 best solutions to this approximate problem, get all columns which are in at least one of the 10 solutions. There are 14 such columns and we then use brute force to find which subset of 5 of the 14 columns has highest coverage.
library(lpSolve)
ones <- rep(1, 300)
res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
# each column of m is logical 300-vector defining possible soln
m <- matrix(res$solution[1:3000] == 1, 300)
# cols is the set of columns which are in any of the 10 solutions
cols <- which(rowSums(m) > 0)
length(cols)
## [1] 14
# use brute force to find the 5 best columns among cols
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1] 90 123 197 199 286
coverage(co[, itop])
## [1] 0.9333333
You can try to test if there is a better column and exchange this with the one currently in the selection.
n <- 5 #Number of columns / events
i <- rep(1, n)
for(k in 1:10) { #How many times itterate
tt <- i
for(j in seq_along(i)) {
x <- +(rowSums(d[,i[-j]]) > 0)
i[j] <- which.max(colSums(x == 0 & d == 1))
}
if(identical(tt, i)) break
}
sort(i)
#[1] 90 123 197 199 286
mean(rowSums(d[,i]) > 0)
#[1] 0.9333333
Taking into account, that the initial condition influences the result you can take random starts.
n <- 5 #Number of columns / events
x <- apply(d, 2, function(x) colSums(x == 0 & d == 1))
diag(x) <- -1
idx <- which(!apply(x==0, 1, any))
x <- apply(d, 2, function(x) colSums(x != d))
diag(x) <- -1
x[upper.tri(x)] <- -1
idx <- unname(c(idx, which(apply(x==0, 1, any))))
res <- sample(idx, n)
for(l in 1:100) {
i <- sample(idx, n)
for(k in 1:10) { #How many times itterate
tt <- i
for(j in seq_along(i)) {
x <- +(rowSums(d[,i[-j]]) > 0)
i[j] <- which.max(colSums(x == 0 & d == 1))
}
if(identical(tt, i)) break
}
if(sum(rowSums(d[,i]) > 0) > sum(rowSums(d[,res]) > 0)) res <- i
}
sort(res)
#[1] 90 123 197 199 286
mean(rowSums(d[,res]) > 0)
#[1] 0.9333333
This is a follow on from a previous question I asked, but adds an extra layer of complexity, hence a new question.
I have two groups (39 and 380 in the example below). What I need to do is assign 889 people into the 39 groups consisting of between 2 to 7 people and the 380 groups consisting of between 2 to 6 people.
However, there is a constraint on the total number of people that can belong in certain sets of groups. In the example below that maximum value allowed for each row is in column X6.
Using the example below. If in row 2 there were 6 people assigned in column X2 and 120 people assigned in column X4 then the total of people would be 18(6*3)+240(120*2) = 258, so that would be fine as it would be under 324.
So what I am after for each row is a value of X1*X2 + X3*X4 (to make column X5) that is less or equal to X6 with the sum of X2 being 39, the sum of X4 being 380 and the total sum of X5 being 889. Ideally any solution would be as random as possible (so if repeated you would get a different solution if possible) and one that would work when the values are different to 889, 39 and 380.
Thanks!
DF <- data.frame(matrix(0, nrow = 7, ncol = 6))
DF[,1] <- c(2:7,"Sum")
DF[7,2] <- 39
DF[2:6,3] <- 2:6
DF[7,4] <- 380
DF[7,5] <- 889
DF[1:6,6] <- c(359, 324, 134, 31, 5, 2)
DF[1,3:4] <- NA
DF[7,3] <- NA
DF[7,6] <- NA
EDIT
The phrasing of my problem may not be clearest. Here is an example of the code I am currently using and how it does not meet the criteria I set above
homeType=rep(c("a", "b"), times=c(39, 380))
H <- vector(mode="list", length(homeType))
for(i in seq(H)){
H[[i]]$type <- homeType[i]
H[[i]]$n <- 0
}
# Place people in houses up to max number of people
npeople <- 889
for(i in seq(npeople)){
placed_in_house <- FALSE
while(!placed_in_house){
house_num <- sample(length(H), 1)
if(H[[house_num]]$type == "a"){
if(H[[house_num]]$n < 7){
H[[house_num]]$n <- H[[house_num]]$n + 1
placed_in_house <- TRUE
}
}
if(H[[house_num]]$type == "b"){
if(H[[house_num]]$n < 6){
H[[house_num]]$n <- H[[house_num]]$n + 1
placed_in_house <- TRUE
}
}
}
}
# move people around to get up to min number of people
for(i in seq(H)){
while(H[[i]]$n < 2){
knock_on_door <- sample(length(H), 1)
if( H[[knock_on_door]]$n > 2){
H[[i]]$n <- H[[i]]$n + 1 # house i takes 1 person
H[[knock_on_door]]$n <- H[[knock_on_door]]$n - 1 # house knock_on_door loses 1 person
}
}
}
Ha <- H[which(lapply(H, function(x){x$type}) == "a")]
Hb <- H[which(lapply(H, function(x){x$type}) == "b")]
Ha_T <- data.frame(t(table(data.frame(matrix(unlist(Ha), nrow=length(Ha), byrow=T)))))
Hb_T <- data.frame(t(table(data.frame(matrix(unlist(Hb), nrow=length(Hb), byrow=T)))))
DF_1 <- data.frame(matrix(0, nrow = 7, ncol = 6))
DF_1[,1] <- c(2:7,"Sum")
DF_1[7,2] <- 39
DF_1[2:6,3] <- 2:6
DF_1[7,4] <- 380
DF_1[7,5] <- 889
DF_1[1:6,6] <- c(359, 324, 134, 31, 5, 2)
for(i in 1:nrow(Ha_T)){DF_1[as.numeric(as.character(Ha_T[i,1]))-1,2] <- Ha_T[i,3]}
for(i in 1:nrow(Hb_T)){DF_1[as.numeric(as.character(Hb_T[i,1])),4] <- Hb_T[i,3]}
DF_1$X5[1:6] <- (as.numeric(as.character(DF_1$X1[1:6]))*DF_1$X2[1:6])+(as.numeric(as.character(DF_1$X3[1:6]))*DF_1$X4[1:6])
DF_1$X7 <- DF_1$X2+DF_1$X4
DF_1[1,3:4] <- NA
DF_1[7,3] <- NA
DF_1[7,6] <- NA
Using this example the problem is row 2 in DF_1. The value in Column X7 (X2+X4) is greater than the permitted number shown in Column X6. What I need is a solution where the values in X7 are less or equal to the values in X6, but the sum of columns X2, X4 and X5 (X1*X2+X3*X4) equal 39, 380 and 889 respectively (although these numbers change depending on the data used).
The original description of the problem in the question is impossible to satisfy, as there are no values that can satisfy all these constraints.
"So what I am after for each row is a value of X1*X2 + X3*X4 (to make
column X5) that is less or equal to X6 with the sum of X2 being 39,
the sum of X4 being 380 and the total sum of X5 being 889. "
However, following a restatement of the problem in the comments, the revised description of the problem can be solved as follows.
Update: Solution based on clarification of the problem in comments
According to a clarification in the comments
"I am not actually filling the number of houses completely. I am just assigning the number of children into houses. This is why
'a' is 2 to 7 and 'b' is 2 to 6, as 'a' households will also include 1
adult and 'b' households 2. For a given area I know how many 2 to 8
person households there are (419), and how many 2,3,4,5,6,7 or 8
person households exist (359,324,134,31,5,2). I also know the total
number of households with either 1 (39) or 2 (380) adults, and how
many children there are (889 in my example)."
Based on this updated information we can do the following, in which we loop over 1) calculate how many more houses of each type can be allocated according to the criteria, 2) randomly select one of the house types that can still be allocated without breaching one of the rules 3) and repeat until all 889 children are in houses. Note that I use more descriptive column names here, to make it easier to follow the logic:
DT <- data.table(HS1 = 2:7, # type 1 house size
NH1 = 0, # number of type 1 houses with children
HS2 = 1:6, # type 2 house size
NH2 = 0, # number of type 2 houses with children
C = 0, # number of children in houses
MaxNH = c(359, 324, 134, 31, 5, 2)) # maximum number of type1+type 2 houses
NR = DT[,.N]
set.seed(1234)
repeat {
while (DT[, sum(C) < 889]) {
DT[, MaxH1 := (MaxNH - NH1 - NH2)]
DT[, MaxH2 := (MaxNH - NH1 - NH2)]
DT[1,MaxH2 := 0 ]
DT[MaxH1 > 39 - sum(NH1), MaxH1 := 39 - sum(NH1)]
DT[MaxH2 > 380- sum(NH2), MaxH2 := 380- sum(NH2)]
if (DT[, sum(NH1)] >= 39) DT[, MaxH1 := 0]
if (DT[, sum(NH2)] >= 380) DT[, MaxH1 := 0]
if (DT[, all(MaxH1==0) & all(MaxH2==0)]) { # check if it is not possible to assign anyone else to a group
print("No solution found. Check constraints or try again")
break
}
# If you wish to preferentially fill a particular type of house, then change the probability weights in the next line accordingly
newgroup = sample(2*NR, 1, prob = DT[, c(MaxH1, MaxH2)])
if (newgroup > NR) DT[rep(1:NR, 2)[newgroup], NH2 := NH2+1] else DT[rep(1:NR, 2)[newgroup], NH1 := NH1+1]
DT[, C := HS1*NH1 + HS2*NH2]
}
if (DT[, sum(C)==889]) break
}
DT[,1:6, with=F]
# HS1 NH1 HS2 NH2 C MaxNH
#1: 2 7 1 0 14 359
#2: 3 7 2 218 457 324
#3: 4 14 3 76 284 134
#4: 5 9 4 14 101 31
#5: 6 2 5 3 27 5
#6: 7 0 6 1 6 2
colSums(DT[, .(NH1, NH2, C)])
# NH1 NH2 C
# 39 312 889
This code provides checks whether the generated data meet the criteria. With each iteration, it stops for the user decision to keep on trying. For me, the selection process never dropped below 348 b-houses with 2 people each and thus the result always violated the second condition (less than 324 houses). Should the a and b house types be offset in the df?
df <- data.frame(a=2:7, afreq=0, b=c(0,2:6), bfreq=0, housed=0, houses=500, correct=c(359, 324, 134, 31, 5, 2))
H <- data.frame(type=homeType, n=0) # using df instead of lists, easier for me
npeople <- 889
while(any(df$houses > df$correct)){
H <- data.frame(type=homeType, n=0)
# This code is yours, changed to df
for(i in 1:npeople){
placed_in_house <- FALSE
while(!placed_in_house){
house_num <- sample(nrow(H), 1)
if(H$type[house_num] == "a"){
if(H$n[house_num] < 7){
H$n[house_num] <- H$n[house_num] + 1
placed_in_house <- TRUE
}
}
if(H$type[house_num] == "b"){
if(H$n[house_num] < 6){
H$n[house_num] <- H$n[house_num] + 1
placed_in_house <- TRUE
}
}
}
}
# Subsets of houses with lack of people and possible sources
# This is iterative to randomize the full dataset
Hempty <- which(H$n < 2)
Hfull <- which(H$n >= 2)
k <- 1 # effort counter
while(length(Hempty) > 0){
for(hempty in Hempty){
knock_on_door <- sample(Hfull, 1)
H$n[knock_on_door] <- H$n[knock_on_door] - 1 # moves from a full house
H$n[hempty] <- H$n[hempty] + 1 # moves into an empty house
}
Hempty <- which(H$n < 2)
Hfull <- which(H$n >= 2)
print(paste("Iteration:", k, ", remaining empty houses:", length(Hempty)))
k <- k + 1
}
# Frequencies how many houses house how many people
freqs <- data.frame(table(H))
df$afreq[match(freqs$n[freqs$type == "a"], df$a)] <- freqs$Freq[freqs$type == "a"]
df$bfreq[match(freqs$n[freqs$type == "b"], df$b)] <- freqs$Freq[freqs$type == "b"]
df$housed <- df[,1]*df[,2] + df[,3]*df[,4]
df$houses <- df$afreq + df$bfreq
# Check what is wrong with the occupancy and let user have a say
print(df)
if(any(df$houses > df$correct)){
readline("There are more houses with a number of occupants than permitter. Hit [enter]")
}
}