Complex function that iterates mvrnorm over rows of a data frame - r

I have data that looks like this:
data = data.frame(a.coef = c(.14, .15, .16),
b.coef = c(.4, .5, .6),
a.var = c(0.0937, 0.0934, 0.0945),
b.var = c(0.00453, 0.00564, 0.00624),
ab.cov = c(0.000747, 0.000747, 0.000747))
and I would like to run the following code (source: http://www.quantpsy.org/medmc/medmc.htm) on each row of the data set.
require(MASS)
a = data$a.coef
b = data$b.coef
rep = 10000
conf = 95
pest = c(a, b)
acov <- matrix(c(data$a.var, data$ab.cov,
data$ab.cov, data$b.var), 2, 2)
mcmc <- mvrnorm(rep, pest, acov, empirical = FALSE)
ab <- mcmc[ , 1] * mcmc[ , 2]
low = (1 - conf / 100) / 2
upp = ((1 - conf / 100) / 2) + (conf / 100)
LL = quantile(ab, low)
UL = quantile(ab, upp)
LL4 = format(LL, digits = 4)
UL4 = format(UL, digits = 4)
I've created a relatively simple function that takes the data and the row number as inputs:
MCMAM <- function(data_input, row_number) {
data = data_input[row_number, ]
a = data[["a.coef"]]
b = data[["b.coef"]]
rep = 10000
conf = 95
pest = c(a, b)
acov <- matrix(c(data[["a.var"]], data[["ab.cov"]],
data[["ab.cov"]], data[["b.var"]]), 2, 2)
require(MASS)
mcmc <- mvrnorm(rep, pest, acov, empirical = FALSE)
ab <- mcmc[, 1] * mcmc[, 2]
low = (1 - conf / 100) / 2
upp = ((1 - conf / 100) / 2) + (conf / 100)
LL = quantile(ab, low)
UL = quantile(ab, upp)
return(c(LL, UL))
}
MCMAM(data, 1)
2.5% 97.5%
-0.1901272 0.3104614
But it would be great if there was a way to get rid of the row specification and just have the function run through the data set row by row and save the output to a new column in the data set.
I've been experimenting with for loops and apply functions but haven't had any success, largely because both the matrix() and mvrnorm() functions take values rather than vectors.

We can use lapply
do.call(rbind, lapply(seq_len(nrow(data)), MCMAM, data_input = data))
-ouptut
2.5% 97.5%
[1,] -0.1832449 0.3098362
[2,] -0.2260856 0.3856575
[3,] -0.2521126 0.4666583
Or use rowwise
library(dplyr)
library(tidyr)
data %>%
rowwise %>%
mutate(new = list(MCMAM(cur_data(), 1))) %>%
unnest_wider(new)
# A tibble: 3 x 7
# a.coef b.coef a.var b.var ab.cov `2.5%` `97.5%`
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 0.14 0.4 0.0937 0.00453 0.000747 -0.185 0.309
#2 0.15 0.5 0.0934 0.00564 0.000747 -0.219 0.396
#3 0.16 0.6 0.0945 0.00624 0.000747 -0.259 0.472

Related

How to optimzie my function by dropping loops

I have the following function that uses nested loops and honestly I'm not sure how to proceed with making the code run more efficient. It runs fine for 100 sims in my opinion but when I ran for 2000 sims it took almost 12 seconds.
This code will generate any n Brownian Motion simulations and works well, the issue is once the simulation size is increased to say 500+ then it starts to bog down, and when it hits 2k then it's pretty slow ie 12.
Here is the function:
ts_brownian_motion <- function(.time = 100, .num_sims = 10, .delta_time = 1,
.initial_value = 0) {
# TidyEval ----
T <- as.numeric(.time)
N <- as.numeric(.num_sims)
delta_t <- as.numeric(.delta_time)
initial_value <- as.numeric(.initial_value)
# Checks ----
if (!is.numeric(T) | !is.numeric(N) | !is.numeric(delta_t) | !is.numeric(initial_value)){
rlang::abort(
message = "All parameters must be numeric values.",
use_cli_format = TRUE
)
}
# Initialize empty data.frame to store the simulations
sim_data <- data.frame()
# Generate N simulations
for (i in 1:N) {
# Initialize the current simulation with a starting value of 0
sim <- c(initial_value)
# Generate the brownian motion values for each time step
for (t in 1:(T / delta_t)) {
sim <- c(sim, sim[t] + rnorm(1, mean = 0, sd = sqrt(delta_t)))
}
# Bind the time steps, simulation values, and simulation number together in a data.frame and add it to the result
sim_data <- rbind(
sim_data,
data.frame(
t = seq(0, T, delta_t),
y = sim,
sim_number = i
)
)
}
# Clean up
sim_data <- sim_data %>%
dplyr::as_tibble() %>%
dplyr::mutate(sim_number = forcats::as_factor(sim_number)) %>%
dplyr::select(sim_number, t, y)
# Return ----
attr(sim_data, ".time") <- .time
attr(sim_data, ".num_sims") <- .num_sims
attr(sim_data, ".delta_time") <- .delta_time
attr(sim_data, ".initial_value") <- .initial_value
return(sim_data)
}
Here is some output of the function:
> ts_brownian_motion(.time = 10, .num_sims = 25)
# A tibble: 275 × 3
sim_number t y
<fct> <dbl> <dbl>
1 1 0 0
2 1 1 -2.13
3 1 2 -1.08
4 1 3 0.0728
5 1 4 0.562
6 1 5 0.255
7 1 6 -1.28
8 1 7 -1.76
9 1 8 -0.770
10 1 9 -0.536
# … with 265 more rows
# ℹ Use `print(n = ...)` to see more rows
As suggested in the comments, if you want speed, you should use cumsum. You need to be clear what type of Brownian Motion you want (arithmetic, geometric). For geometric Brownian motion, you'll need to correct the approximation error by adjusting the mean. As an example, the NMOF package (which I maintain), contains a function gbm that implements geometric Brownian Motion through cumsum. Here is an example call for 2000 paths with 100 timesteps each.
library("NMOF")
library("zoo") ## for plotting
timesteps <- 100
system.time(b <- NMOF::gbm(2000, tau = 1, timesteps = 100, r = 0, v = 1))
## user system elapsed
## 0.013 0.000 0.013
dim(b) ## each column is one path, starting at time zero
## [1] 101 2000
plot(zoo(b[, 1:5], 0:timesteps), plot.type = "single")

fast looping for user written functions

I have written two functions
a) the first create simulated data and estimates a model
b) the second iterates this process a number of times, and average statistics from multiple simulations.
The third step I would like to do is to iterate this process across different sample sizes. I know how to do this with a for loop but it takesvery long. Does anyone has suggestions on how to improve looping speed?
In particular, I would be interested in using parallel processing or evaluating alternative looping packages like purrr.
Here is an example:
# create a the first function simulates data and estimates the model
genmodel <- function (n,meanx,meany){
df <- as.data.frame(list(mean_x=rnorm(n=n, mean=meanx, sd=1)))
df <- df %>% mutate(mean_y=rnorm(n=n, mean=meany, sd=1))
model<- lm_robust(mean_y ~ mean_x, data=df,
se_type = "stata")
pval<- as.data.frame(list(p=summary(model)$coefficients)) %>% t()
pval <- as.data.frame(pval) %>% rownames_to_column()
return(pval)
}
# example
> genmodel(n=100,meanx=2,meany=1)
rowname (Intercept) mean_x
1 p.Estimate 9.984653e-01 -0.05115484
2 p.Std..Error 2.027905e-01 0.10273142
3 p.t.value 4.923630e+00 -0.49794738
4 p.Pr...t.. 3.441203e-06 0.61963671
5 p.CI.Lower 5.960341e-01 -0.25502201
6 p.CI.Upper 1.400896e+00 0.15271232
7 p.DF 9.800000e+01 98.00000000
Generate the second function that that iterate the first function a number of times and averages estimated statistics
average_model <- function(nrep=100, # number of simulations
n,
mean_x,
mean_y
){
tmpres<- lapply(1:nrep, function(x) genmodel(n=n,meanx=mean_x,meany=mean_y))
tmpres <- do.call(rbind, tmpres)
vec<- names(tmpres[2:ncol(tmpres)])
tmpres <- unique(setDT(tmpres)[,paste("avg",(vec),sep = "_"):=map(.SD,~ mean(.x)),by=rowname,.SDcols=(vec)
][,nobs:=n] %>% select(rowname,`avg_(Intercept)`,avg_mean_x,nobs))
}
# example
tst<-average_model(nrep=50,n=100,mean_x=2,mean_y=1)
rowname avg_(Intercept) avg_mean_x nobs
1: p.Estimate 1.06002378 -0.03100749 100
2: p.Std..Error 0.22368299 0.09921118 100
3: p.t.value 4.83878275 -0.31190506 100
4: p.Pr...t.. 0.00206157 0.45198433 100
5: p.CI.Lower 0.61613217 -0.22788884 100
6: p.CI.Upper 1.50391540 0.16587386 100
7: p.DF 98.00000000 98.00000000 100
Now my objective is to iterate this average_model function over different sample sizes and to create a unique data frame with all of the information. This can be easily done using a for loop
for (i in seq(from=100,to=500,by=30)){
tmpres <- average_model(nrep=50,n=i,mean_x=2,mean_y=1)
results <- rbind(results, tmpres) # sequentially paste results
head(results)
rowname avg_(Intercept) avg_mean_x nobs
1: p.Estimate 1.001296821 0.000989775 100
2: p.Std..Error 0.224800002 0.099078646 100
3: p.t.value 4.530076894 0.027428073 100
4: p.Pr...t.. 0.001934362 0.504152193 100
5: p.CI.Lower 0.555188534 -0.195628574 100
6: p.CI.Upper 1.447405108 0.197608124 100
# it can also be done using `apply`, but both approach are quite slow
tmpres<- lapply(seq(from=100,to=500,by=30), function(x) average_model(nrep=50,n=x,mean_x=2,mean_y=1)
tmpres <- do.call(rbind, tmpres)
The problem with this for loop is that it is extremely slow.
Is there a way I could do this using parallel processing? Other suggestions for reducing running time?
This "all data.table" approach is about twice as fast, but still disappointing.
The basic idea is to assemble all the datasets into one large data.table and then cycle through the models using data.table group by.
library(data.table)
library(estimatr)
library(tictoc)
##
tic()
mf <- data.table(nrep=1:50, meanx=2, meany=1)
mf <- mf[, .(n=seq(100, 500, 30)), by=.(nrep, meanx, meany)]
data <- mf[, .(mean_x=rnorm(n, meanx), mean_y=rnorm(n, meany)), by=.(n, nrep, meanx, meany)]
result <- data[, as.data.table(t(summary(lm(mean_y~mean_x, .SD, se_type = 'stata'))$coefficients), keep.rownames = TRUE)
, by=.(n, nrep, meanx, meany)][, nrep:=NULL]
result <- result[, lapply(.SD, mean), by=.(n, meanx, meany, rn)]
toc()
## 2.58 sec elapsed
So this takes between 2.3 - 2.6 sec on my machine, wheres your code runs in about 4.0 - 4.1 sec. About 80% of the time is spent running lm_robust(...). If I swap that out for lm(...) in base R it runs in about 1 sec.
This can be done more straight forward:
expand_grid(
nreps = 50,
n = seq.default(100, 500, by = 30),
mean_x = 2, mean_y = 1
) %>%
rowid_to_column("n_idx") %>%
uncount(nreps, .remove = FALSE) %>%
rowid_to_column("nreps_idx") %>%
rowwise() %>%
mutate(
lm_robust =
estimatr::lm_robust(
y ~ X,
data =
tibble(y = rnorm(n, mean = mean_y, sd = 1),
X = rnorm(n, mean = mean_x, sd = 1)),
se_type = "stata"
) %>%
coefficients() %>%
set_names(str_c("coef_", names(.))) %>%
list()
) %>%
unnest_wider(lm_robust) %>%
group_by(nreps_idx) %>%
summarise(
n = unique(n),
across(starts_with("coef"), mean),
)
Which result in
# A tibble: 700 × 4
nreps_idx n `coef_(Intercept)` coef_X
<int> <dbl> <dbl> <dbl>
1 1 100 1.34 -0.183
2 2 100 0.845 0.0188
3 3 100 0.949 0.0341
4 4 100 1.20 -0.0705
5 5 100 0.731 0.0419
6 6 100 0.809 0.0564
7 7 100 0.920 0.0558
8 8 100 1.22 -0.0673
9 9 100 1.22 -0.171
10 10 100 1.26 -0.127
# … with 690 more rows
Which is computed fairly quickly.
Now, I've not included all the parameters in your code, because honestly it doesn't make sense to take the mean of them, but if you want them as well then...
expand_grid(
nreps = 50,
n = seq.default(100, 500, by = 30),
mean_x = 2, mean_y = 1
) %>%
rowid_to_column("n_idx") %>%
uncount(nreps, .remove = FALSE) %>%
rowid_to_column("nreps_idx") %>%
rowwise() %>%
mutate(
lm_robust =
estimatr::lm_robust(
y ~ X,
data =
tibble(y = rnorm(n, mean = mean_y, sd = 1),
X = rnorm(n, mean = mean_x, sd = 1)),
se_type = "stata"
) %>%
# SECOND APPROACH
summary() %>%
`[[`("coefficients") %>%
as_tibble(rownames = "rowname") %>%
pivot_wider(names_from = "rowname",
values_from = everything()) %>%
# FIRST APPROACH
# coefficients() %>%
# set_names(str_c("coef_", names(.))) %>%
list()
) %>%
unnest_wider(lm_robust) %>%
print() %>%
group_by(nreps_idx) %>%
summarise(
n = unique(n),
across(starts_with("Estimate"), mean),
# insert statements here to summarise the other gathered stuff
)
But this makes things unnecessarily complicated.

How can I get the coefficients from nlsList into a dataframe?

Is there a way to extract just the estimates from nlsList()?
Sample data:
library(nlme)
dat<-read.table(text="time gluc starch solka
1 6.32 7.51 1.95
2 20.11 25.49 6.43
3 36.03 47.53 10.39
6 107.52 166.31 27.01
12 259.28 305.19 113.72
24 283.40 342.56 251.14
48 297.55 353.66 314.22", header = TRUE)
long <- tidyr::pivot_longer(dat, -1, values_to = "y")
long$name <- factor(long$name)
st0 <- list(Max = 200, k = 0.1, Lag = 0.5)
kinetics<-nlsList(y ~ (time > Lag) * Max * (1-exp(-k * (time - Lag))) | name, long, start = st0)
I would like to end up with a data frame like the image below of the samples and their estimates for Max, k, and Lag but cannot figure out how.
We could extract the coef and then loop over the 3d array with apply
library(nlme)
m1 <- apply(summary(kinetics)$coef, 3, function(x) x[,1])
dat <- transform(as.data.frame(m1), name = row.names(m1))[c(4, 1:3)]
row.names(dat) <- NULL
-output
dat
name Max k Lag
1 gluc 299.6637 0.16155846 2.426204
2 solka 337.5416 0.06583197 4.966971
3 starch 353.7206 0.18416048 2.276593
Here is a simple way, just coerce the coefficients appropriate dimension to class "data.frame".
cf_smry <- coef(summary(kinetics))[, 1, ]
as.data.frame(cf_smry)
# Max k Lag
#gluc 299.6637 0.16155846 2.426204
#solka 337.5416 0.06583197 4.966971
#starch 353.7206 0.18416048 2.276593
coef(kinetics) gives a data frame so any of these would work and differ only in whether the names appear as row names (first one) or as a column (others).
coef(kinetics)
data.frame(name = names(kinetics), coef(kinetics), row.names = NULL)
tibble::rownames_to_column(coef(kinetics), var = "name")

Looking for an apply, tidyr or dplyr solution to a nested for loop situation in R

Weirdly for this one, I think its easier to start by viewing the df.
#reproducible data
quantiles<-c("50","90")
var=c("w","d")
df=data.frame(a=runif(20,0.01,.5),b=runif(20,0.02,.5),c=runif(20,0.03,.5),e=runif(20,0.04,.5),
q50=runif(20,1,5),q90=runif(20,10,50))
head(df)
I want to automate a function that I've created (below) to calculate vars using different combinations of values from my df.
For example, the calculation of w needs to use a and b, and d needs to use c and e such that w = a *q ^ b and d = c * q ^ e. Further, q is a quantile, so I actually want w50, w90, etc., which will correspond to q50, q90 etc. from the df.
The tricky part as i see it is setting the condition to use a & b vs. c & d without using nested loops.
I have a function to calculate vars using the appropriate columns, however I can't get all the pieces together efficiently.
#function to calculate the w, d
calc_wd <- function(df,col_name,col1,col2,col3){
#Calculate and create new column col_name for each combo of var and quantile, e.g. "w_50", "d_50", etc.
df[[col_name]] <- df[[col1]] * (df[[col2]] ^ (df[[col3]]))
df
}
I can get this to work for a single case, but not by automating the coefficient swap... you'll see I specify "a" and "b" below.
wd<-c("w_","d_")
make_wd_list<-apply(expand.grid(wd, quantiles), 1, paste,collapse="")
calc_wdv(df,make_wd_list[1],"a",paste0("q",sapply(strsplit(make_wd_list[1],"_"),tail,1)),"b")
Alternatively, I have tried to make this work using nested for loops, but can't seem to append the data correctly. And its ugly.
var=c("w","d")
dataf<-data.frame()
for(j in unique(var)){
if(j=="w"){
coeff1="a"
coeff2="b"
}else if(j=="d"){
coeff1="c"
coeff1="e"
}
print(coeff1)
print(coeff2)
for(k in unique(quantiles)){
dataf<-calc_wd(df,paste0(j,k),coeff1,paste0("q",k),coeff2)
dataf[k,j]=rbind(df,dataf) #this aint right. tried to do.call outside, etc.
}
}
In the end, I'm looking to have new columns with w_50, w_90, etc., which use q50, q90 and the corresponding coefficients as defined originally.
One approach I find easy to type is using purrr::pmap. I like this because when you use with(list(...),), you can access the column names of your data.frame by name. Additionally, you can supply additional arguments.
library(purrr)
pmap_df(df, quant = "q90", ~with(list(...),{
list(w = a * get(quant) ^ b, d = c * get(quant) ^ e)
}))
## A tibble: 20 x 2
# w d
# <dbl> <dbl>
# 1 0.239 0.295
# 2 0.152 0.392
# 3 0.476 0.828
# 4 0.344 0.236
# 5 0.439 1.00
You could combine this with for example a second map call to iterate over quantiles.
library(dplyr)
map(setNames(quantiles,quantiles),
~ pmap_df(df, quant = paste0("q",.x),
~ with(list(...),{list(w = a * get(quant) ^ b, d = c * get(quant) ^ e)}))
) %>% do.call(cbind,.)
# 50.w 50.d 90.w 90.d
#1 0.63585897 0.11045837 1.7276019 0.1784987
#2 0.17286184 0.22033649 0.2333682 0.5200265
#3 0.32437528 0.72502654 0.5722203 1.4490065
#4 0.68020897 0.33797621 0.8749206 0.6179557
#5 0.73516886 0.38481785 1.2782923 0.4870877
Then assigning a custom function is trivial.
calcwd <- function(df,quantiles){
map(setNames(quantiles,quantiles),
~ pmap_df(df, quant = paste0("q",.x),
~ with(list(...),{list(w = a * get(quant) ^ b, d = c * get(quant) ^ e)}))
) %>% do.call(cbind,.)
}
I love #Ian's answer for the completeness and the use of classics like with and do.call. I'm late to the scene with my solution but since I have been trying to get better with rowwise operations (including the use of rowwise thought I would offer up a less elegant but simpler and faster solution using just mutate, formula.tools and map_dfc
library(dplyr)
library(purrr)
require(formula.tools)
# same type example data plus a much larger version in df2 for
# performance testing
df <- data.frame(a = runif(20, 0.01, .5),
b = runif(20, 0.02, .5),
c = runif(20, 0.03, .5),
e = runif(20, 0.04, .5),
q50 = runif(20,1,5),
q90 = runif(20,10,50)
)
df2 <- data.frame(a = runif(20000, 0.01, .5),
b = runif(20000, 0.02, .5),
c = runif(20000, 0.03, .5),
e = runif(20000, 0.04, .5),
q50 = runif(20000,1,5),
q90 = runif(20000,10,50)
)
# from your original post
quantiles <- c("q50", "q90")
wd <- c("w_", "d_")
make_wd_list <- apply(expand.grid(wd, quantiles),
1,
paste, collapse = "")
make_wd_list
#> [1] "w_q50" "d_q50" "w_q90" "d_q90"
# an empty list to hold our formulas
eqn_list <- vector(mode = "list",
length = length(make_wd_list)
)
# populate the list makes it very extensible to more outcomes
# or to more quantile levels
for (i in seq_along(make_wd_list)) {
if (substr(make_wd_list[[i]], 1, 1) == "w") {
eqn_list[[i]] <- as.formula(paste(make_wd_list[[i]], "~ a * ", substr(make_wd_list[[i]], 3, 5), " ^ b"))
} else if (substr(make_wd_list[[i]], 1, 1) == "d") {
eqn_list[[i]] <- as.formula(paste(make_wd_list[[i]], "~ c * ", substr(make_wd_list[[i]], 3, 5), " ^ e"))
}
}
# formula.tools helps us grab both left and right sides
add_column <- function(df, equation){
df <- transmute_(df, rhs(equation))
colnames(df)[ncol(df)] <- as.character(lhs(equation))
return(df)
}
result <- map_dfc(eqn_list, ~ add_column(df = df, equation = .x))
#> w_q50 d_q50 w_q90 d_q90
#> 1 0.10580863 0.29136904 0.37839737 0.9014040
#> 2 0.34798729 0.35185585 0.64196417 0.4257495
#> 3 0.79714122 0.37242915 1.57594506 0.6198531
#> 4 0.56446922 0.43432160 1.07458217 1.1082825
#> 5 0.26896574 0.07374273 0.28557366 0.1678035
#> 6 0.36840408 0.72458466 0.72741030 1.2480547
#> 7 0.64484009 0.69464045 1.93290705 2.1663690
#> 8 0.43336109 0.21265672 0.46187366 0.4365486
#> 9 0.61340404 0.47528697 0.89286358 0.5383290
#> 10 0.36983212 0.53292900 0.53996112 0.8488402
#> 11 0.11278412 0.12532491 0.12486156 0.2413191
#> 12 0.03599639 0.25578020 0.04084221 0.3284659
#> 13 0.26308183 0.05322304 0.87057854 0.1817630
#> 14 0.06533586 0.22458880 0.09085436 0.3391683
#> 15 0.11625845 0.32995233 0.12749040 0.4730407
#> 16 0.81584442 0.07733376 2.15108243 0.1041342
#> 17 0.38198254 0.60263861 0.68082354 0.8502999
#> 18 0.51756058 0.43398089 1.06683204 1.3397900
#> 19 0.34490492 0.13790601 0.69168711 0.1580659
#> 20 0.39771037 0.33286225 1.32578056 0.4141457
microbenchmark::microbenchmark(result <- map_dfc(eqn_list, ~ add_column(df = df2, equation = .x)), times = 10)
#> Unit: milliseconds
#> expr min
#> result <- map_dfc(eqn_list, ~add_column(df = df2, equation = .x)) 10.58004
#> lq mean median uq max neval
#> 11.34603 12.56774 11.6257 13.24273 16.91417 10
The mutate and formula solution is about fifty times faster although both rip through 20,000 rows in less than a second
Created on 2020-04-30 by the reprex package (v0.3.0)

view values used by function boot to bootstrap estimates

I have written the code below to obtain a bootstrap estimate of a mean. My objective is to view the numbers selected from the data set, ideally in the order they are selected, by the function boot in the boot package.
The data set only contains three numbers: 1, 10, and 100 and I am only using two bootstrap samples.
The estimated mean is 23.5 and the R code below indicates that the six numbers included one '1', four '10' and one '100'. However, there are 30 possible combinations of those numbers that would have resulted in a mean of 23.5.
Is there a way for me to determine which of those 30 possible combinations is the combination that actually appeared in the two bootstrap samples?
library(boot)
set.seed(1234)
dat <- c(1, 10, 100)
av <- function(dat, i) { sum(dat[i])/length(dat[i]) }
av.boot <- boot(dat, av, R = 2)
av.boot
#
# ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
# Call:
# boot(data = dat, statistic = av, R = 2)
#
#
# Bootstrap Statistics :
# original bias std. error
# t1* 37 -13.5 19.09188
#
mean(dat) + -13.5
# [1] 23.5
# The two samples must have contained one '1', four '10' and one '100',
# but there are 30 possibilities.
# Which of these 30 possible sequences actual occurred?
# This code shows there must have been one '1', four '10' and one '100'
# and shows the 30 possible combinations
my.combos <- expand.grid(V1 = c(1, 10, 100),
V2 = c(1, 10, 100),
V3 = c(1, 10, 100),
V4 = c(1, 10, 100),
V5 = c(1, 10, 100),
V6 = c(1, 10, 100))
my.means <- apply(my.combos, 1, function(x) {( (x[1] + x[2] + x[3])/3 + (x[4] + x[5] + x[6])/3 ) / 2 })
possible.samples <- my.combos[my.means == 23.5,]
dim(possible.samples)
n.1 <- rowSums(possible.samples == 1)
n.10 <- rowSums(possible.samples == 10)
n.100 <- rowSums(possible.samples == 100)
n.1[1]
n.10[1]
n.100[1]
length(unique(n.1)) == 1
length(unique(n.10)) == 1
length(unique(n.100)) == 1
I think you can determine the numbers sampled and the order in which they are sampled with the code below. You have to extract the function ordinary.array from the boot package and paste that function into your R code. Then specify the values for n, R and strata, where n is the number of observations in the data set and R is the number of replicate samples you want.
I do not know how general this approach is, but it worked with a couple of simple examples I tried, including the example below.
library(boot)
set.seed(1234)
dat <- c(1, 10, 100, 1000)
av <- function(dat, i) { sum(dat[i])/length(dat[i]) }
av.boot <- boot(dat, av, R = 3)
av.boot
#
# ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
# Call:
# boot(data = dat, statistic = av, R = 3)
#
#
# Bootstrap Statistics :
# original bias std. error
# t1* 277.75 -127.5 132.2405
#
#
mean(dat) + -127.5
# [1] 150.25
# boot:::ordinary.array
ordinary.array <- function (n, R, strata)
{
inds <- as.integer(names(table(strata)))
if (length(inds) == 1L) {
output <- sample.int(n, n * R, replace = TRUE)
dim(output) <- c(R, n)
}
else {
output <- matrix(as.integer(0L), R, n)
for (is in inds) {
gp <- seq_len(n)[strata == is]
output[, gp] <- if (length(gp) == 1)
rep(gp, R)
else bsample(gp, R * length(gp))
}
}
output
}
# I think the function ordinary.array determines which elements
# of the data are sampled in each of the R samples
set.seed(1234)
ordinary.array(n=4,R=3,1)
# [,1] [,2] [,3] [,4]
# [1,] 1 3 1 3
# [2,] 3 4 1 3
# [3,] 3 3 3 3
#
# which equals:
((1+100+1+100) / 4 + (100+1000+1+100) / 4 + (100+100+100+100) / 4) / 3
# [1] 150.25

Resources