Efficient Montecarlo simulation over a grid in R - r

I am running a Montecarlo simulation of a multinomial logit. Therefore I have a function that generates the data and estimates the model. Additionally, I want to generate different datasets over a grid of values. In particular, changing both the number of individuals (n.indiv) and the number of answers by each individual (n.choices).
So far, I have managed to solve it, but at some point, I incurred into a nested for-loop structure over a grid search of the possible values for the number of individuals (n.indiv_list) and the number of answers by each individual(n.choices_list). Finally, I am quite worried about the efficiency of the usage of my last bit of code with the double for-loop structure running on the combinations of the possible values. Probably there is a vectorized way to do it that I am missing (or maybe not?).
Finally, and this is mostly a matter of style, I managed to arrive a multiples objects that contain the models from the combinations of the grid search with informative names, but also would be great if I could collapse all of them in a list but with the current structure, I am not sure how to do it. Thank you in advance!
1) Function that generates data and estimates the model.
library(dplyr)
library(VGAM)
library(mlogit)
#function that generates the data and estimates the model.
mlogit_sim_data <- function(...){
# generating number of (n.alter) X (n.choices)
df <- data.frame(id= rep(seq(1,n.choices ),n.alter ))
# id per individual
df <- df %>%
group_by(id) %>%
mutate(altern = sequence(n()))%>%
arrange(id)
#Repeated scheme for each individual + id_ind
df <- cbind(df[rep(1:nrow(df), n.indiv), ], id_ind = rep(1:n.indiv, each = nrow(df)))
## creating attributes
df<- df %>%
mutate(
x1=rlnorm(n.indiv*n.alter),
x2=rlnorm(n.indiv*n.alter),
)%>%
group_by(altern) %>%
mutate(
id_choice = sequence(n()))%>%
group_by(id_ind) %>%
mutate(
z1 = rpois(1,lambda = 25),
z2 = rlnorm(1,meanlog = 5, sdlog = 0.5),
z3 = ifelse(runif(1, min = 0 , max = 1) > 0.5 , 1 , 0)
)
# Observed utility
df$V1 <- with(df, b1 * x1 + b2 * x2 )
#### Generate Response Variable ####
fn_choice_generator <- function(V){
U <- V + rgumbel(length(V), 0, 1)
1L * (U == max(U))
}
# Using fn_choice_generator to generate 'choice' columns
df <- df %>%
group_by(id_choice) %>%
mutate(across(starts_with("V"),
fn_choice_generator, .names = "choice_{.col}")) %>% # generating choice(s)
select(-starts_with("V")) %>% ##drop V variables.
select(-c(id,id_ind))
tryCatch(
{
model_result <- mlogit(choice_V1 ~ 0 + x1 + x2 |1 ,
data = df,
idx = c("id_choice", "altern"))
return(model_result)
},
error = function(e){
return(NA)
}
)
}
2) Grid search over possible combinations of the data
#List with the values that varies in the simulation
#number of individuals
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
# Values that remains constant across simulations
#set number of alternatives
n.alter <- 3
## Real parameters
b1 <- 1
b2 <- 2
#Number of reps
nreps <- 10
#Set seed
set.seed(777)
#iteration over different values in the simulation
for(i in n.indiv_list) {
for(j in n.choices_list) {
n.indiv <- i
n.choices <- j
assign(paste0("m_ind_", i, "_choices_", j), lapply(X = 1:nreps, FUN = mlogit_sim_data))
}
}

You can vectorize using the map2 function of the purrr package:
library(tidyverse)
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
l1 <- length(n.indiv_list)
l2 <- length(n.choices_list)
v1 <- rep(n.indiv_list, each = l2)
v2 <- rep(n.choices_list, l1) #v1, v2 generate all pairs
> v1
[1] 1 1 1 1 1 15 15 15 15 15 100 100 100 100 100 500 500 500 500 500
> v2
[1] 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10
result <- map2(v1, v2, function(v1, v2) assign(paste0("m_ind_", v1, "_choices_", v2), lapply(X = 1:nreps, FUN = mlogit_sim_data)))
result will be a list of your function outputs.

Related

Converting Lists into Data Frames

I am trying to make a data frame with 3 columns (and 100 rows) containing random numbers such that the random numbers in each row add to 72.
I am using this code to generate the random numbers:
random_numbers <- diff(c(0, sort(sample(72, 2)), 72))
Although, I can't "fit" these random numbers into a data frame because of the format. For example:
i <- 1:100
d <- data.frame(i)
d$rand <- diff(c(0, sort(sample(72, 2)), 72))
Error in `$<-.data.frame`(`*tmp*`, rand, value = c(37, 21, 14)) :
replacement has 3 rows, data has 100
I had another idea in which at least I can create all 100 random numbers:
results <- list()
for (i in 1:100) {
r_i <- diff(c(0, sort(sample(72, 2)), 72))
results[[i]] <- r_i
}
results[1]
# [[1]]
# [1] 3 19 50
results[2]
# [[1]]
# [1] 16 11 45
But I am not sure how I can I can create a data frame with 3 columns and 100 rows from this data.
I know how to do this in the "classical" sense:
i <- 1:100
r_1 <- rnorm(5, 5, 100)
r_2 <- rnorm(5, 5, 100)
r_3 <- rnorm(5, 5, 100)
d <- data.frame(i, r_1, r_2, r_3)
d = data.frame(i, r_1, r_2, r_3)
But of course, in the above data frame, these 3 numbers will most certainly not add to 72.
Is it possible to take the 100 random numbers results that I generated above and then place them into a data frame?
We may use replicate with n specified as the number of rows of 'd' and assign new columns from the matrix output
d[paste0("r_", 1:3)] <- t(replicate(nrow(d),
diff(c(0, sort(sample(72, 2)), 72))))
-testing for equality
> all(rowSums(d[-1]) == 72)
[1] TRUE

Test if a given numeric is comprised in a set of intervals defined in a dataframe

I have a dataframe which includes 2 columns, let's say "left" and "right", which define intervals. I want to test if a given numeric "x" is part of any interval defined by the dataframe (if it is, it should be only once, those intervals don't overlap). Expected behaviour:
> df <- data.frame(id = c("A", "B", "C"), left = c(0, 50, 150), right = c(15, 78, 190))
> df
id left right
1 A 0 15
2 B 50 78
3 C 150 190
> my_function(7)
TRUE
> my_function(20)
FALSE
So I did it this way, but it's terribly slow and I'm pretty sure this could be optimized:
my_function <- function(x) {
test <- df %>% dplyr::rowwise() %>% dplyr::mutate(test = (x >= left) && (x <= right)) %>% ungroup()
test <- test %>% filter(test == T)
nrow(test) == 1
}
Then I'd be interested in getting the matching row in case the output is TRUE, but with the current function it'll take forever (the actual dataframe has ~5,000 rows, and I want to test/get coordinates for thousands of x values).
I found a library that manages interval objets but it seems it's tailored for time intervals. Any suggestion?
Here is a simple way with an example:
z <- 567 # single dummy value
left <- x1 <- seq(100, 900, 200)
right <- seq(200, 1000, 200)
df <- data.frame(left, right) # dummy intervals
lo <- z >= df$left
hi <- z <= df$right
check <- lo * hi
introw <- which(check == 1)
introw
3
z2 <- c(356, 934, 134, 597, 771) # vector of values to check
lo2 <- sapply(z2, function(x) x >= df$left)
hi2 <- sapply(z2, function(x) x <= df$right)
check2 <- lo2 * hi2
introws <- apply(check2, 2, function(x) which(x ==1))
introws #vector of intervals for each input value
introws
2 5 1 3 4
final <- cbind(value = z2, interval = introws)
final
value interval
[1,] 356 2
[2,] 934 5
[3,] 134 1
[4,] 597 3
[5,] 771 4
Try this approach using between():
#Code
my_function <- function(x) {
test <- df %>% dplyr::rowwise() %>%
dplyr::mutate(test = between(x,left,right)) %>% ungroup()
test <- test %>% filter(test == T)
nrow(test) == 1
}

Find combination of n vectors across k dataframes with highest correlation

Let's assume four data frames, each with 3 vectors, e.g.
setA <- data.frame(
a1 = c(6,5,2,4,5,3,4,4,5,3),
a2 = c(4,3,1,4,5,1,1,6,3,2),
a3 = c(5,4,5,6,4,6,5,5,3,3)
)
setB <- data.frame(
b1 = c(5,3,4,3,3,6,4,4,3,5),
b2 = c(4,3,1,3,5,2,5,2,5,6),
b3 = c(6,5,4,3,2,6,4,3,4,6)
)
setC <- data.frame(
c1 = c(4,4,5,5,6,4,2,2,4,6),
c2 = c(3,3,4,4,2,1,2,3,5,4),
c3 = c(4,5,4,3,5,5,3,5,5,6)
)
setD <- data.frame(
d1 = c(5,5,4,4,3,5,3,5,5,4),
d2 = c(4,4,3,3,4,3,4,3,4,5),
d3 = c(6,5,5,3,3,4,2,5,5,4)
)
I'm trying to find n number of vectors in each data frame, that have the highest correlation among each other. For this simple example, let's say want to find the n = 1 vectors in each of the k = 4 data frames, that show the overall strongest, positive correlation cor().
I'm not interested in the correlation of vectors within a data frame, but the correlation between data frames, since i wish to pick 1 variable from each set.
Intuitively, I would sum all the correlation coefficients for each combination, i.e.:
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b2, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b1, setC$c2, setC$d1)))
... # and so on...
...but this seems like brute-forcing a solution that might be solvable more elegantly, with some kind of clustering-technique?
Anyhow, I was hoping to find a dynamic solution like function(n = 1, ...) where (... for data frames) which would return a list of the highest correlating vector names.
Base on your example I would not go with a really complicated algorithm unless your actual data is huge. This is a simple approach I think gets what you want.
So base on your 4 data frames a creates the list_df and then in the function I just generate all the possible combinations of variables an calculate their correlation. At the end I select the n combinations with highest correlation.
list_df = list(setA,setB,setC,setD)
CombMaxCor = function(n = 1,list_df){
column_names = lapply(list_df,colnames)
mat_comb = expand.grid(column_names)
mat_total = do.call(cbind,list_df)
vec_cor = rep(NA,nrow(mat_comb))
for(i in 1:nrow(mat_comb)){
vec_cor[i] = sum(cor(mat_total[,as.character(unlist(mat_comb[i,]))]))
}
pos_max_temp = rev(sort(vec_cor))[1:n]
pos_max = vec_cor%in%pos_max_temp
comb_max_cor = mat_comb[pos_max,]
return(comb_max_cor)
}
You could use comb function:
fun = function(x){
nm = paste0(names(x),collapse="")
if(!grepl("(.)\\d.*\\1",nm,perl = T))
setNames(sum(cor(x)),nm)
}
unlist(combn(a,4,fun,simplify = FALSE))[1:3]#Only printed the first 3
a1b1c1d1 a1b1c1d2 a1b1c1d3
3.246442 4.097532 3.566949
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d1)))
[1] 3.246442
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d2)))
[1] 4.097532
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d3)))
[1] 3.566949
Here is a function we can use to get n non-repeating columns from each data frame to get the max total correlation:
func <- function(n, ...){
list.df <- list(...)
n.df <- length(list.df)
# 1) First get the correlations
get.two.df.cors <- function(df1, df2) apply(df1, 2,
function(x) apply(df2, 2, function(y) cor(x,y))
)
cor.combns <- lapply(list.df, function(x)
lapply(list.df, function(y) get.two.df.cors(x,y))
)
# 2) Define function to help with aggregating the correlations.
# We will call them for different combinations of selected columns from each df later
# cmbns: given a df corresponding columns to be selected each data frame
# (i-th row corresponds to i-th df),
# return the "total correlation"
get.cmbn.sum <- function(cmbns, cor.combns){
# a helper matrix to help aggregation
# each row represents which two data frames we want to get the correlation sums
df.df <- t(combn(seq(n.df), 2, c))
# convert to list of selections for each df
cmbns <- split(cmbns, seq(nrow(cmbns)))
sums <- apply(df.df, 1,
function(dfs) sum(
cor.combns[[dfs[1]]][[dfs[2]]][cmbns[[dfs[2]]], cmbns[[dfs[1]]]]
)
)
# sum of the sums give the "total correlation"
sum(sums)
}
# 3) Now perform the aggragation
# get the methods of choosing n columns from each of the k data frames
if (n==1) {
cmbns.each.df <- lapply(list.df, function(df) matrix(seq(ncol(df)), ncol=1))
} else {
cmbns.each.df <- lapply(list.df, function(df) t(combn(seq(ncol(df)), n, c)))
}
# get all unique selection methods
unique.selections <- Reduce(function(all.dfs, new.df){
all.dfs.lst <- rep(list(all.dfs), nrow(new.df))
all.new.rows <- lapply(seq(nrow(new.df)), function(x) new.df[x,,drop=F])
for(i in seq(nrow(new.df))){
for(j in seq(length(all.dfs.lst[[i]]))){
all.dfs.lst[[i]][[j]] <- rbind(all.dfs.lst[[i]][[j]], all.new.rows[[i]])
}
}
do.call(c, all.dfs.lst)
}, c(list(list(matrix(numeric(0), nrow=0, ncol=n))), cmbns.each.df))
# for each unique selection method, calculate the total correlation
result <- sapply(unique.selections, get.cmbn.sum, cor.combns=cor.combns)
return( unique.selections[[which.max(result)]] )
}
And now we have:
# n = 1
func(1, setA, setB, setC, setD)
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
# [4,] 2
# n = 2
func(2, setA, setB, setC, setD)
# [,1] [,2]
# [1,] 1 2
# [2,] 2 3
# [3,] 2 3
# [4,] 2 3

Creating mock data with natural decreasing numbers

I want to create random mock data looks like this.
__ID__|__Amount__
1 20
1 14
1 9
1 3
2 11
2 5
2 2
Starting from the random number but the second number with the same ID should be lesser than the first one, and the third number has to be lesser than the second one. Maximum number to start should be 20.
you can just create the data first and then sort it as you need, using tidyverse :
set.seed(0)
df <- data.frame(id = rep(1:3,10), amt = sample(1:20, 30, replace = TRUE))
df %>%
group_by(id) %>%
arrange(id, desc(amt))
This is a tricky one if you want the Amount column to be truly random values you can use a recursive call that will use sample recursively:
## Recursively sampling from a uniform distribution
recursive.sample <- function(start, end, length, results = NA, counter =0) {
## To enter the recursion, counter must be smaller than the length out
## and the last result must be smaller than the starting point (except the firs time)
if(counter < length && ifelse(counter != 0, results[counter] > start, TRUE)){
## Increment the counter
counter <- counter + 1
## Sample between start and the last result or the start and the end of the vector
results[counter] <- ifelse(counter != 1, sample(start:results[counter-1], 1), sample(start:end, 1))
## Recursive call
return(recursive.sample(start = start, end = end, length = length, results = results, counter = counter))
} else {
## Exit the recursion
return(results)
}
}
## Example
set.seed(0)
recursive.sample(start = 1, end = 20, length = 3, results = NA, counter = 0)
#[1] 18 5 2
Alternatively (and way easier) you can use sort(sample()):
set.seed(0)
sort(sample(1:20, 3), decreasing = TRUE)
#[1] 18 7 6
Note that the results differ due to the lower probability of sampling higher values in the recursive function.
You can then easily create your table with your chosen function as follow:
set.seed(123)
## The ID column
ID <- c(rep(1, 4), rep(2,3))
## The Amount column
Amount <- c(recursive.sample(1, 20, 4, NA, 0), recursive.sample(1, 11, 3, NA, 0))
## The table
cbind(ID, Amount)
# ID Amount
#[1,] 1 18
#[2,] 1 5
#[3,] 1 2
#[4,] 1 2
#[5,] 2 10
#[6,] 2 3
#[7,] 2 3
Or, again, with the simple sort(sample()) function for a higher probability of picking larger numbers.
Two methods, one using dplyr and one using only base R functions. These are slightly different to the two previous solutions.
I used sorted ID column, but this is not necessary.
Method 1
rm(list = ls())
set.seed(1)
df <- data.frame(ID = rep(1:3, each = 5))
df %>% group_by(ID) %>%
mutate(Amount = sort(sample(1 : 20, n(), replace = T), decreasing = TRUE))
Method 2
rm(list = ls())
set.seed(1)
df <- data.frame(ID = rep(1:3, each = 5))
df$Amount <- NA
uniq_ID <- unique(df$ID)
index_lst <- lapply(uniq_ID, function(x) which(df$ID == x))
res <- lapply(index_lst, function(x) sort(sample(1 : 20, length(x)),
decreasing = TRUE))
df$Amount[unlist(index_lst)] <- unlist(res)
Method 2.5
This is more convoluted than the 2nd method.
rm(list = ls())
set.seed(1)
df <- data.frame(ID = rep(1:3, each = 5))
df$Amount <- NA
tab <- as.data.frame(table(df$ID))
lapply(1 : nrow(tab), function(x) df$Amount[which(df$ID == tab$Var1[x])] <<-
sort(sample(1 : 20, tab$Freq[x]), decreasing = TRUE))

How to pairwise compare values referring to distinct elements in two matrices of different formats?

I've got a set of objects, let's say with the IDs 'A' to 'J'. And I've got two data frames which look the following way (as you can see, the second data frame is symmetric):
df1 <- data.frame(ID = LETTERS[1:5], Var = c(9,13,15,11,28))
df2 <- as.data.frame(matrix(data = c(NA,42,83,74,84,42,NA,26,69,9,83,26,NA,67,95,74,69,67,NA,6,84,9,95,6,NA), ncol = 5, nrow = 5, dimnames = list(df1$ID, df1$ID)))
For example, take the objects 'B' and 'E'. I want to know: Is 13+28 (from df1) less than 9 (from df2)? I'd like to know this for all pairs of objects. The output should be
(a) a logical data frame structured like df2 and
(b) the number of "TRUE" values.
Most of the time I will only need result (b), but sometimes I would also need (a). So if (b) can be calculated without (a) and if this would be significantly faster, then I'd like to have both algorithms in order to select the suitable one dependent on which output I need to answer a particular question.
I'm comparing around 2000 objects, so the algorithm should be reasonably fast. So far I've been only able to implement this with two nested for-loops which is awfully slow. I bet there is a much nicer way to do this, maybe exploiting vectorisation.
This is what it currently looks like:
df3 <- as.data.frame(matrix(data = NA, ncol = nrow(df1), nrow = nrow(df1),
dimnames = list(df1$ID, df1$ID)))
for (i in 2:nrow(df3)){
for (j in 1:(i-1)){
sum.val <- df1[df1$ID == rownames(df3)[i], "Var"] + df1[df1$ID == names(df3)[j], "Var"]
df3[i,j] <- sum.val <= df2[i,j]
}
}
#
Is this what you want?
df3 <- outer(df1$Var, df1$Var, "+")
df3
df4 <- df3 < df2
df4
sum(df4, na.rm = TRUE)
Here's one way to do it...
# Get row and column indices
ind <- t( combn( df1$ID , 2 ) )
# Get totals
tot <- with( df1 , Var[ match( ind[,1] , ID ) ] + Var[ match( ind[,2] , ID ) ] )
# Make df2 a matrix
m <- as.matrix( df2 )
# Total number of values is simply
sum( m[ ind ] > tot )
#[1] 7
# Find which values in upper triangle part of the matrix exceed those from df1 (1 = TRUE)
m[upper.tri(m)] <- m[ ind ] > tot
# A B C D E
#A NA 1 1 1 0
#B 42 NA 1 0 1
#C 83 26 NA 1 1
#D 74 69 67 NA 0
#E 84 9 95 6 NA
This will do what you want.
# Generate the data
df1 <- data.frame(ID = LETTERS[1:5], Var = c(9,13,15,11,28))
df2 <- as.data.frame(matrix(data = c(NA,42,83,74,84,42,NA,26,
69,9,83,26,NA,67,95,74,69,
67,NA,6,84,9,95,6,NA),
ncol = 5, nrow = 5,
dimnames = list(df1$ID, df1$ID)))
# Define a pairwise comparison index matrix using 'combn'
idx <- combn(nrow(df1), 2)
# Create a results matrix
res <- matrix(NA, ncol = ncol(df2), nrow = nrow(df2))
# Loop through 'idx' for each possible comparison (without repeats)
for(i in 1:ncol(idx)){
logiTest <- (df1$Var[idx[1,i]] + df1$Var[idx[2,i]]) < df2[idx[1,i], idx[2,i]]
res[idx[1,i], idx[2, i]] <- logiTest
res[idx[2,i], idx[1, i]] <- logiTest
}
# Count the number of 'true' comparisons
nTrues <- sum(res, na.rm = TRUE)/2
The code simply uses a pairwise comparison index (idx) to define which elements in both df1 and df2 are to be used in each iteration of the 'for loop'. It then uses this same index to define where in the 'res' matrix the answer to the logical test is to be written.
N.B. This code will break down if the order of elements in df1 and df2 are not the same. In such cases, it would be appropriate to use the actual letters to define which values to compare.

Resources