Related
In July of 2021, the survey package overhauled the svyquantile() function. Quoting from the package's NEWS:
svyquantile() has been COMPLETELY REWRITTEN. The old version is available as oldsvyquantile().
The svyquantile() was being used in one of my packages when this change occurred. A co-author updated the code to retain use of the old version of the function for continuity. Now that it's been 1.5 years since the release, we're updating my code to use the new version. However, we're am experiencing an issue with the new version when combining its usage with svyby().
When we use the updated function, the results for all levels of the by variable come out equivalent. Also, the returned data frame duplicates each of the columns (once for each level of the by variable).
My guess is that rather than showing the results for the by variable in each row, the results are mistakenly being added as new columns (with repeated column names, and duplicated rows).
Am I doing something wrong with svyby(), or is this a bug in the code? The survey package is so widely used and I am no expert in its use, so I lean toward there being an issue with my code. But I am having trouble diagnosing the problem.
Thank you!
suppressPackageStartupMessages(library(survey))
packageVersion("survey")
#> [1] '4.1.1'
data(api)
# OLD svyquantile, results as expected
svyby(
# svyby args
formula = ~api00,
by = ~both,
design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
FUN = oldsvyquantile,
# args being passed to oldsvyquantile
na.rm = TRUE,
keep.var = FALSE,
quantiles = 0.5
)
#> both statistic
#> No No 631.0
#> Yes Yes 653.5
# NEW svyquantile, I don't understand what I am doing wrong
svyby(
# svyby args
formula = ~api00,
by = ~both,
design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
FUN = svyquantile,
# args being passed to svyquantile
na.rm = TRUE,
keep.var = FALSE,
quantiles = 0.5
)
#> both statistic.statistic.api00.quantile statistic.statistic.api00.ci.2.5
#> No No 631 547
#> Yes Yes 631 547
#> statistic.statistic.api00.ci.97.5 statistic.statistic.api00.se
#> No 722 39.75493
#> Yes 722 39.75493
#> statistic.statistic.api00.quantile statistic.statistic.api00.ci.2.5
#> No 655 566
#> Yes 655 566
#> statistic.statistic.api00.ci.97.5 statistic.statistic.api00.se
#> No 717 34.94774
#> Yes 717 34.94774
# results are as expected when used outside `svyby()`
svyquantile(
x = ~api00,
design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
na.rm = TRUE,
keep.var = FALSE,
quantiles = 0.5
)
#> $api00
#> quantile ci.2.5 ci.97.5 se
#> 0.5 652 561 714 35.66788
#>
#> attr(,"hasci")
#> [1] TRUE
#> attr(,"class")
#> [1] "newsvyquantile"
Created on 2022-12-25 with reprex v2.0.2
It's a bug. It's fixed in the development version, on r-forge (https://r-forge.r-project.org/R/?group_id=1788)
I have a function with five variables that I want to maximize using only an specific set of parameters for each variable.
Are there any methods in R that can do this, other than by brutal force? (e.g. Particle Swarm Optimization, Genetic Algorithm, Greedy, etc.). I have read a few packages but they seem to create their own set of parameters from within a given range. I am only interested in optimizing the set of options provided.
Here is a simplified version of the problem:
#Example of 5 variable function to optimize
Fn<-function(x){
a=x[1]
b=x[2]
c=x[3]
d=x[4]
e=x[5]
SUM=a+b+c+d+e
return(SUM)
}
#Parameters for variables to optimize
Vars=list(
As=c(seq(1.5,3, by = 0.3)), #float
Bs=c(1,2), #Binary
Cs=c(seq(1,60, by=10)), #Integer
Ds=c(seq(60,-60, length.out=5)), #Negtive
Es=c(1,2,3)
)
#Full combination
FullCombn= expand.grid(Vars)
Results=data.frame(I=as.numeric(), Sum=as.numeric())
for (i in 1:nrow(FullCombn)){
ParsI=FullCombn[i,]
ResultI=Fn(ParsI)
Results=rbind(Results,c(I=i,Sum=ResultI))
}
#Best iteration (Largest result)
Best=Results[Results[, 2] == max(Results[, 2]),]
#Best parameters
FullCombn[Best$I,]
Two more possibilities. Both minimize by default, so I flip the sign in your objective function (i.e. return -SUM).
#Example of 5 variable function to optimize
Fn<-function(x, ...){
a=x[1]
b=x[2]
c=x[3]
d=x[4]
e=x[5]
SUM=a+b+c+d+e
return(-SUM)
}
#Parameters for variables to optimize
Vars=list(
As=c(seq(1.5,3, by = 0.3)), #float
Bs=c(1,2), #Binary
Cs=c(seq(1,60, by=10)), #Integer
Ds=c(seq(60,-60, length.out=5)), #Negtive
Es=c(1,2,3)
)
First, a grid search. Exactly what you did, just convenient. And the implementation allows you to distribute the evaluations of the objective function.
library("NMOF")
gridSearch(fun = Fn,
levels = Vars)[c("minfun", "minlevels")]
## 5 variables with 6, 2, 6, 5, ... levels: 1080 function evaluations required.
## $minfun
## [1] -119
##
## $minlevels
## [1] 3 2 51 60 3
An alternative: a simple Local Search. You start with a valid initial guess, and then move randomly through possible feasible solutions. The key ingredient is the neighbourhood function. It picks one element randomly and then, again randomly, sets this element to one allowed value.
nb <- function(x, levels, ...) {
i <- sample(length(levels), 1)
x[i] <- sample(levels[[i]], 1)
x
}
(There would be better algorithms for neighbourhood functions; but this one is simple and so demonstrates the idea well.)
LSopt(Fn, list(x0 = c(1.8, 2, 11, 30, 2), ## a feasible initial solution
neighbour = nb,
nI = 200 ## iterations
),
levels = Vars)$xbest
## Local Search.
## ##...
## Best solution overall: -119
## [1] 3 2 51 60 3
(Disclosure: I am the maintainer of package NMOF, which provides functions gridSearch and LSopt.)
In response to the comment, a few remarks on Local Search and the neighbourhood function above (nb). Local Search, as implemented in
LSopt, will start with an arbitrary solution, and
then change that solution slightly. This new solution,
called a neighbour, will be compared (by its
objective-function value) to the old solution. If the new solution is
better, it becomes the current solution; otherwise it
is rejected and the old solution remains the current one.
Then the algorithm repeats, for a number of iterations.
So, in short, Local Search is not random sampling, but
a guided random-walk through the search space. It's
guided because only better solutions get accepted, worse one's get rejected. In this sense, LSopt will narrow down on good parameter values.
The implementation of the neighbourhood is not ideal
for two reasons. The first is that a solution may not
be changed at all, since I sample from feasible
values. But for a small set of possible values as here,
it might often happen that the same element is selected
again. However, for larger search spaces, this
inefficiency is typically negligible, since the
probability of sampling the same value becomes
smaller. Often so small, that the additional code for
testing if the solution has changed becomes more
expensive that the occasionally-wasted iteration.
A second thing could be improved, albeit through a more
complicated function. And again, for this small problem it does not matter. In the current neighbourhood, an
element is picked and then set to any feasible value.
But that means that changes from one solution to the
next might be large. Instead of picking any feasible values of the As,
in realistic problems it will often be better to pick a
value close to the current value. For example, when you are at 2.1, either move to 1.8 or 2.4, but not to 3.0. (This reasoning is only relevant, of course, if the variable in question is on a numeric or at least ordinal scale.)
Ultimately, what implementation works well can be
tested only empirically. Many more details are in this tutorial.
Here is one alternative implementation. A solution is now a vector of positions for the original values, e.g. if x[1] is 2, it "points" to 1.8, if x[2] is 2, it points to 1, and so on.
## precompute lengths of vectors in Vars
lens <- lengths(Vars)
nb2 <- function(x, lens, ...) {
i <- sample(length(lens), 1)
if (x[i] == 1L) {
x[i] <- 2
} else if (x[i] == lens[i]) {
x[i] <- lens[i] - 1
} else
x[i] <- x[i] + sample(c(1, -1), 1)
x
}
## the objective function now needs to map the
## indices in x back to the levels in Vars
Fn2 <- function(x, levels, ...){
y <- mapply(`[`, levels, x)
## => same as
## y <- numeric(length(x))
## y[1] <- Vars[[1]][x[1]]
## y[2] <- Vars[[2]][x[2]]
## ....
SUM <- sum(y)
return(-SUM)
}
xbest <- LSopt(Fn2,
list(x0 = c(1, 1, 1, 1, 1), ## an initial solution
neighbour = nb2,
nI = 200 ## iterations
),
levels = Vars,
lens = lens)$xbest
## Local Search.
## ....
## Best solution overall: -119
## map the solution back to the values
mapply(`[`, Vars, xbest)
## As Bs Cs Ds Es
## 3 2 51 60 3
Here is a genetic algorithm solution with package GA.
The key is to write a function decode enforcing the constraints, see the package vignette.
library(GA)
#> Loading required package: foreach
#> Loading required package: iterators
#> Package 'GA' version 3.2.2
#> Type 'citation("GA")' for citing this R package in publications.
#>
#> Attaching package: 'GA'
#> The following object is masked from 'package:utils':
#>
#> de
decode <- function(x) {
As <- Vars$As
Bs <- Vars$Bs
Cs <- Vars$Cs
Ds <- rev(Vars$Ds)
# fix real variable As
i <- findInterval(x[1], As)
if(x[1L] - As[i] < As[i + 1L] - x[1L])
x[1L] <- As[i]
else x[1L] <- As[i + 1L]
# fix binary variable Bs
if(x[2L] - Bs[1L] < Bs[2L] - x[2L])
x[2L] <- Bs[1L]
else x[2L] <- Bs[2L]
# fix integer variable Cs
i <- findInterval(x[3L], Cs)
if(x[3L] - Cs[i] < Cs[i + 1L] - x[3L])
x[3L] <- Cs[i]
else x[3L] <- Cs[i + 1L]
# fix integer variable Ds
i <- findInterval(x[4L], Ds)
if(x[4L] - Ds[i] < Ds[i + 1L] - x[4L])
x[4L] <- Ds[i]
else x[4L] <- Ds[i + 1L]
# fix the other, integer variable
x[5L] <- round(x[5L])
setNames(x , c("As", "Bs", "Cs", "Ds", "Es"))
}
Fn <- function(x){
x <- decode(x)
# a <- x[1]
# b <- x[2]
# c <- x[3]
# d <- x[4]
# e <- x[5]
# SUM <- a + b + c + d + e
SUM <- sum(x, na.rm = TRUE)
return(SUM)
}
#Parameters for variables to optimize
Vars <- list(
As = seq(1.5, 3, by = 0.3), # Float
Bs = c(1, 2), # Binary
Cs = seq(1, 60, by = 10), # Integer
Ds = seq(60, -60, length.out = 5), # Negative
Es = c(1, 2, 3)
)
res <- ga(type = "real-valued",
fitness = Fn,
lower = c(1.5, 1, 1, -60, 1),
upper = c(3, 2, 51, 60, 3),
popSize = 1000,
seed = 123)
summary(res)
#> ── Genetic Algorithm ───────────────────
#>
#> GA settings:
#> Type = real-valued
#> Population size = 1000
#> Number of generations = 100
#> Elitism = 50
#> Crossover probability = 0.8
#> Mutation probability = 0.1
#> Search domain =
#> x1 x2 x3 x4 x5
#> lower 1.5 1 1 -60 1
#> upper 3.0 2 51 60 3
#>
#> GA results:
#> Iterations = 100
#> Fitness function value = 119
#> Solutions =
#> x1 x2 x3 x4 x5
#> [1,] 2.854089 1.556080 46.11389 49.31045 2.532682
#> [2,] 2.869408 1.638266 46.12966 48.71106 2.559620
#> [3,] 2.865254 1.665405 46.21684 49.04667 2.528606
#> [4,] 2.866494 1.630416 46.12736 48.78017 2.530454
#> [5,] 2.860940 1.650015 46.31773 48.92642 2.521276
#> [6,] 2.851644 1.660358 46.09504 48.81425 2.525504
#> [7,] 2.855078 1.611837 46.13855 48.62022 2.575492
#> [8,] 2.857066 1.588893 46.15918 48.60505 2.588992
#> [9,] 2.862644 1.637806 46.20663 48.92781 2.579260
#> [10,] 2.861573 1.630762 46.23494 48.90927 2.555612
#> ...
#> [59,] 2.853788 1.640810 46.35649 48.87381 2.536682
#> [60,] 2.859090 1.658127 46.15508 48.85404 2.590679
apply(res#solution, 1, decode) |> t() |> unique()
#> As Bs Cs Ds Es
#> [1,] 3 2 51 60 3
Created on 2022-10-24 with reprex v2.0.2
I am rather new to R and really could need the help of the community with the following problem. I am trying to solve for the variable r in the following equation: (EPS2 + r*DPS1-EPS1)/r^2)-PRC. Here is my (unsuccessful) attempt on solving the problem (using the uniroot function):
EPS2 = df_final$EPS2
DPS1 = df_final$DPS1
EPS1 = df_final$EPS1
PRC = df_final$PRC
f1 = function(r) {
((df_final_test$EPS2 + r * df_final_test$DPS1-df_final_test$EPS1)/r^2)-df_final_test$PRC
}
uniroot(f1,interval = c(1e-8,100000),EPS2, DPS1, EPS1, PRC , extendInt="downX")$root
I then get the following error: Error in f(lower, ...) : unused
arguments (c(" 1.39", " 1.39", ...
I am grateful for any tips and hints you guys could give me in regard to this problem. Or whether a different function/package would be better in this case.
Added a reprex (?) in case that helps anybody in helping me with this issue:
df <- structure(list(EPS1 = c(6.53, 1.32, 1.39, 1.71, 2.13), DPS1 = c(2.53, 0.63,
0.81, 1.08, 1.33, 19.8), EPS2 = c(7.57,1.39,1.43,1.85,2.49), PRC = c(19.01,38.27,44.82,35.27,47.12)), .Names = c("EPS1", "DPS1", "EPS2", "PRC"), row.names = c(NA,
-5L), class = "data.frame")
I don't think you can use uniroot if all coefficients are vectors rather than scalars. In this case, a straightforward approach is solving them in a math way, i.e.,
r1 <- (DPS1 + sqrt(DPS1^2-4*PRC*(EPS1-EPS2)))/(2*PRC)
and
r2 <- (DPS1 - sqrt(DPS1^2-4*PRC*(EPS1-EPS2)))/(2*PRC)
where r1 and r2 are two roots.
Disclaimer: I have no experience with uniroot() and have not idea if the following makes sense, but it runs! The idea was to basically call uniroot for each row of the data frame.
Note that I modified the function f1 slightly so each of the additional parameters has are to be passed as arguments of the function and do not rely on finding the objects with the same name in the parent environment. I also use with to avoid calling df$... for every variable.
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.1.0
library(furrr)
#> Loading required package: future
df <- structure(list(EPS1 = c(6.53, 1.32, 1.39, 1.71, 2.13),
DPS1 = c(2.53, 0.63, 0.81, 1.08, 1.33, 19.8),
EPS2 = c(7.57,1.39,1.43,1.85,2.49),
PRC = c(19.01,38.27,44.82,35.27,47.12)),
.Names = c("EPS1", "DPS1", "EPS2", "PRC"),
row.names = c(NA,-5L), class = "data.frame")
df
#> Warning in format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
#> corrupt data frame: columns will be truncated or padded with NAs
#> EPS1 DPS1 EPS2 PRC
#> 1 6.53 2.53 7.57 19.01
#> 2 1.32 0.63 1.39 38.27
#> 3 1.39 0.81 1.43 44.82
#> 4 1.71 1.08 1.85 35.27
#> 5 2.13 1.33 2.49 47.12
f1 = function(r, EPS2, DPS1, EPS1, PRC) {
(( EPS2 + r * DPS1 - EPS1)/r^2) - PRC
}
# try for first row
with(df,
uniroot(f1,
EPS2=EPS2[1], DPS1=DPS1[1], EPS1=EPS1[1], PRC=PRC[1],
interval = c(1e-8,100000),
extendInt="downX")$root)
#> [1] 0.3097291
# it runs!
# loop over each row
vec_sols <- rep(NA, nrow(df))
for (i in seq_along(1:nrow(df))) {
sol <- with(df, uniroot(f1,
EPS2=EPS2[i], DPS1=DPS1[i], EPS1=EPS1[i], PRC=PRC[i],
interval = c(1e-8,100000),
extendInt="downX")$root)
vec_sols[i] <- sol
}
vec_sols
#> [1] 0.30972906 0.05177443 0.04022946 0.08015686 0.10265226
# Alternatively, you can use furrr's future_map_dbl to use multiple cores.
# the following will basically do the same as the above loop.
# here with 4 cores.
plan(multisession, workers = 4)
vec_sols <- 1:nrow(df) %>% furrr::future_map_dbl(
.f = ~with(df,
uniroot(f1,
EPS2=EPS2[.x], DPS1=DPS1[.x], EPS1=EPS1[.x], PRC=PRC[.x],
interval = c(1e-8,100000),
extendInt="downX")$root
))
vec_sols
#> [1] 0.30972906 0.05177443 0.04022946 0.08015686 0.10265226
# then apply the solutions back to the dataframe (each row to each solution)
df %>% mutate(
root = vec_sols
)
#> Warning in format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
#> corrupt data frame: columns will be truncated or padded with NAs
#> EPS1 DPS1 EPS2 PRC root
#> 1 6.53 2.53 7.57 19.01 0.30972906
#> 2 1.32 0.63 1.39 38.27 0.05177443
#> 3 1.39 0.81 1.43 44.82 0.04022946
#> 4 1.71 1.08 1.85 35.27 0.08015686
#> 5 2.13 1.33 2.49 47.12 0.10265226
Created on 2021-06-20 by the reprex package (v2.0.0)
I'm new to optimisation/calibration of models in R, but i'm eager to learn and really need some help. My question relates to demographic modelling.
I've done some research and found help here and here but neither have quite answered my question.
I have a matrix of scalars (propensities) where each column must total to 1. These propensities are used to estimate the number of households that would arise from a given population (persons by age). The propensities model tends to overestimate the number of households in history (for which I know the true number of households).
I want to calibrate the model to minimise the error in the number of households by tweaking the propensities such that the columns still add to 1 and propensities with an initial value of zero must remain zero.
Simple example:
# Propensities matrix
mtx <- matrix(c(0.00, 0.00, 0.85, 0.00, 0.15, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Population by age cohort
pop <- c(2600, 16200, 13400)
# True number of households
target <- c(7000, 4500, 5500)
# Function to optimise
hh <- function(mtx, pop, target) {
# Estimate living arrangements
x <- mtx %*% pop
# Estimate number of households using parent cohorts (1,2 and 4)
x <- c(x[1,1]/2, x[2,1]/2, x[4,1]) - target
return(x)
}
I haven't included any of my code for the optimisation/calibration step as it would be embarrassing and I've haven't been able to get anything to work!
Ideally i will have one set of propensities that generalises well for lots of different regions at the end of this process. Any advice on how i should go about achieving it? Helpful links?
Update
The snippet of code below executes the local search method as suggested by Enrico.
library(tidyverse)
library(NMOF)
data <- list(mtx = matrix(c(0.00, 0.00, 0.90, 0.00, 0.10, 0.25, 0.50, 0.00,
0.25, 0.00, 0.60, 0.20, 0.00, 0.20, 0.00), ncol = 3),
pop = c(2600, 16200, 13400),
target = c(7190, 4650, 5920))
# True mtx
mtx.true <- matrix(c(0.00, 0.00, 0.75, 0.00, 0.25, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Function to optimise
households <- function(x, data) {
# Estimate living arrangements
z <- x %*% data$pop
# Estimate number of households using parent cohorts (1,2 and 4)
z <- c(z[1,1]/2, z[2,1]/2, z[4,1]) - data$target
sum(abs(z))
}
# Local search function to perturb propensities
neighbour <- function(x, data) {
# Choose random column from mtx
i <- sample(1:ncol(x), 1)
# Select two non-zero propensities from mtx column
j <- which(x[, i] != 0) %>% sample(2, replace = FALSE)
# Randomnly select one to perturb positively
x[j[1], i] <- 0.1 * (1 - x[j[1], i]) + x[j[1], i]
# Perturb second propensity to ensure mtx column adds to 1
x[j[2], i] <- x[j[2], i] + (1 - sum(x[,i]))
x
}
# Local search algorithm inputs
localsearch <- list(x0 = data$mtx,
neighbour = neighbour,
nS = 50000,
printBar = FALSE)
# Execute
now <- Sys.time()
solution <- LSopt(OF = households, algo = localsearch, data)
#>
#> Local Search.
#> Initial solution: 2695
#> Finished.
#> Best solution overall: 425.25
Sys.time() - now
#> Time difference of 6.33272 secs
# Inspect propensity matrices
print(solution$xbest)
#> [,1] [,2] [,3]
#> [1,] 0.0000000 0.3925 0.6
#> [2,] 0.0000000 0.4250 0.2
#> [3,] 0.2937976 0.0000 0.0
#> [4,] 0.0000000 0.1825 0.2
#> [5,] 0.7062024 0.0000 0.0
print(mtx.true)
#> [,1] [,2] [,3]
#> [1,] 0.00 0.35 0.65
#> [2,] 0.00 0.45 0.15
#> [3,] 0.75 0.00 0.00
#> [4,] 0.00 0.20 0.20
#> [5,] 0.25 0.00 0.00
Thanks!
I can only comment on the optimisation part.
The code you have provided is sufficient; only your objective function evaluates to a vector. You will need to transform this vector into a single number that is to be minimised, such as the sum of squares or of absolute values.
When it comes to methods, I would try heuristics; in fact, I would try a Local-Search method. These methods operate on the solution through functions which you define; thus, you may code your solution as a matrix. More specifically, you would need two functions: the objective function (which you essentially have) and a neighbourhood function, which takes as input a solution and modifies it. In your particular case, it could take a matrix, select two none-zero elements from one column, and increase one and decrease the other. Thus, the column sum would remain unchanged.
Perhaps the tutorial http://enricoschumann.net/files/NMOF_Rmetrics2012.pdf is of interest, with R code http://enricoschumann.net/files/NMOF_Rmetrics2012.R .
I am trying to simulate certain discrete variable depicting "true state of the world" (say, "red", "green" or "blue") and its indicator, somewhat imperfectly describing it.
r_names <- c("real_R", "real_G", "real_B")
Lets say I have some prior belief about distribution of "reality" variable, which I will use to sample it.
r_probs <- c(0.3, 0.5, 0.2)
set.seed(100)
reality <- sample(seq_along(r_names), 10000, prob=r_probs, replace = TRUE)
Now, let's say I have conditional probability table that stipulates the value of indicator given each of the "realities"
ri_matrix <- matrix(c(0.7, 0.3, 0,
0.2, 0.6, 0.2,
0.05,0.15,0.8), byrow=TRUE,nrow = 3)
dimnames(ri_matrix) <- list(paste("real", r_names, sep="_"),
paste("ind", r_names, sep="_"))
ri_matrix
># ind_R ind_G ind_B
># real_Red 0.70 0.30 0.0
># real_Green 0.20 0.60 0.2
># real_Blue 0.05 0.15 0.8
Since base::sample() is not vectorized for prob argument, I have to:
sample_cond <- function(r, rim){
unlist(lapply(r, function(x)
sample(seq_len(ncol(rim)), 1, prob = rim[x,], replace = TRUE)))
}
Now I can sample my "indicator" variable using the conditional probability matrix
set.seed(200)
indicator <- sample_cond(reality, ri_matrix)
Just to make sure the distributions turned out as expected:
prop.table(table(reality, indicator), margin = 1)
#> indicator
#> reality 1 2 3
#> 1 0.70043610 0.29956390 0.00000000
#> 2 0.19976124 0.59331476 0.20692400
#> 3 0.04365278 0.14400401 0.81234320
Is there a better (i.e. more idiomatic and/or efficient) way to sample a discrete variable conditioned on another discrete random variable?
UPDATE:
As suggested by #Mr.Flick, this is at least 50x faster, because it reuses probability vectors instead of repeated subsetting of the conditional probability matrix.
sample_cond_group <- function(r, rim){
il <- mapply(function(x,y){sample(seq(ncol(rim)), length(x), prob = y, replace = TRUE)},
x=split(r, r),
y=split(rim, seq(nrow(rim))))
unsplit(il, r)
}
You can be a bit more efficient by drawing all the random samples per group with a split/combine type strategy. That might look something like this
simFun <- function(N, r_probs, ri_matrix) {
stopifnot(length(r_probs) == nrow(ri_matrix))
ind <- sample.int(length(r_probs), N, prob = r_probs, replace=TRUE)
grp <- split(data.frame(ind), ind)
unsplit(Map(function(data, r) {
draw <-sample.int(ncol(ri_matrix), nrow(data), replace=TRUE, prob=ri_matrix[r, ])
data.frame(data, draw)
}, grp, as.numeric(names(grp))), ind)
}
Than you can call with
simFun(10000, r_probs, ri_matrix)