Related
I am running a Montecarlo simulation of a multinomial logit. Therefore I have a function that generates the data and estimates the model. Additionally, I want to generate different datasets over a grid of values. In particular, changing both the number of individuals (n.indiv) and the number of answers by each individual (n.choices).
So far, I have managed to solve it, but at some point, I incurred into a nested for-loop structure over a grid search of the possible values for the number of individuals (n.indiv_list) and the number of answers by each individual(n.choices_list). Finally, I am quite worried about the efficiency of the usage of my last bit of code with the double for-loop structure running on the combinations of the possible values. Probably there is a vectorized way to do it that I am missing (or maybe not?).
Finally, and this is mostly a matter of style, I managed to arrive a multiples objects that contain the models from the combinations of the grid search with informative names, but also would be great if I could collapse all of them in a list but with the current structure, I am not sure how to do it. Thank you in advance!
1) Function that generates data and estimates the model.
library(dplyr)
library(VGAM)
library(mlogit)
#function that generates the data and estimates the model.
mlogit_sim_data <- function(...){
# generating number of (n.alter) X (n.choices)
df <- data.frame(id= rep(seq(1,n.choices ),n.alter ))
# id per individual
df <- df %>%
group_by(id) %>%
mutate(altern = sequence(n()))%>%
arrange(id)
#Repeated scheme for each individual + id_ind
df <- cbind(df[rep(1:nrow(df), n.indiv), ], id_ind = rep(1:n.indiv, each = nrow(df)))
## creating attributes
df<- df %>%
mutate(
x1=rlnorm(n.indiv*n.alter),
x2=rlnorm(n.indiv*n.alter),
)%>%
group_by(altern) %>%
mutate(
id_choice = sequence(n()))%>%
group_by(id_ind) %>%
mutate(
z1 = rpois(1,lambda = 25),
z2 = rlnorm(1,meanlog = 5, sdlog = 0.5),
z3 = ifelse(runif(1, min = 0 , max = 1) > 0.5 , 1 , 0)
)
# Observed utility
df$V1 <- with(df, b1 * x1 + b2 * x2 )
#### Generate Response Variable ####
fn_choice_generator <- function(V){
U <- V + rgumbel(length(V), 0, 1)
1L * (U == max(U))
}
# Using fn_choice_generator to generate 'choice' columns
df <- df %>%
group_by(id_choice) %>%
mutate(across(starts_with("V"),
fn_choice_generator, .names = "choice_{.col}")) %>% # generating choice(s)
select(-starts_with("V")) %>% ##drop V variables.
select(-c(id,id_ind))
tryCatch(
{
model_result <- mlogit(choice_V1 ~ 0 + x1 + x2 |1 ,
data = df,
idx = c("id_choice", "altern"))
return(model_result)
},
error = function(e){
return(NA)
}
)
}
2) Grid search over possible combinations of the data
#List with the values that varies in the simulation
#number of individuals
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
# Values that remains constant across simulations
#set number of alternatives
n.alter <- 3
## Real parameters
b1 <- 1
b2 <- 2
#Number of reps
nreps <- 10
#Set seed
set.seed(777)
#iteration over different values in the simulation
for(i in n.indiv_list) {
for(j in n.choices_list) {
n.indiv <- i
n.choices <- j
assign(paste0("m_ind_", i, "_choices_", j), lapply(X = 1:nreps, FUN = mlogit_sim_data))
}
}
You can vectorize using the map2 function of the purrr package:
library(tidyverse)
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
l1 <- length(n.indiv_list)
l2 <- length(n.choices_list)
v1 <- rep(n.indiv_list, each = l2)
v2 <- rep(n.choices_list, l1) #v1, v2 generate all pairs
> v1
[1] 1 1 1 1 1 15 15 15 15 15 100 100 100 100 100 500 500 500 500 500
> v2
[1] 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10
result <- map2(v1, v2, function(v1, v2) assign(paste0("m_ind_", v1, "_choices_", v2), lapply(X = 1:nreps, FUN = mlogit_sim_data)))
result will be a list of your function outputs.
Let's assume four data frames, each with 3 vectors, e.g.
setA <- data.frame(
a1 = c(6,5,2,4,5,3,4,4,5,3),
a2 = c(4,3,1,4,5,1,1,6,3,2),
a3 = c(5,4,5,6,4,6,5,5,3,3)
)
setB <- data.frame(
b1 = c(5,3,4,3,3,6,4,4,3,5),
b2 = c(4,3,1,3,5,2,5,2,5,6),
b3 = c(6,5,4,3,2,6,4,3,4,6)
)
setC <- data.frame(
c1 = c(4,4,5,5,6,4,2,2,4,6),
c2 = c(3,3,4,4,2,1,2,3,5,4),
c3 = c(4,5,4,3,5,5,3,5,5,6)
)
setD <- data.frame(
d1 = c(5,5,4,4,3,5,3,5,5,4),
d2 = c(4,4,3,3,4,3,4,3,4,5),
d3 = c(6,5,5,3,3,4,2,5,5,4)
)
I'm trying to find n number of vectors in each data frame, that have the highest correlation among each other. For this simple example, let's say want to find the n = 1 vectors in each of the k = 4 data frames, that show the overall strongest, positive correlation cor().
I'm not interested in the correlation of vectors within a data frame, but the correlation between data frames, since i wish to pick 1 variable from each set.
Intuitively, I would sum all the correlation coefficients for each combination, i.e.:
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b2, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b1, setC$c2, setC$d1)))
... # and so on...
...but this seems like brute-forcing a solution that might be solvable more elegantly, with some kind of clustering-technique?
Anyhow, I was hoping to find a dynamic solution like function(n = 1, ...) where (... for data frames) which would return a list of the highest correlating vector names.
Base on your example I would not go with a really complicated algorithm unless your actual data is huge. This is a simple approach I think gets what you want.
So base on your 4 data frames a creates the list_df and then in the function I just generate all the possible combinations of variables an calculate their correlation. At the end I select the n combinations with highest correlation.
list_df = list(setA,setB,setC,setD)
CombMaxCor = function(n = 1,list_df){
column_names = lapply(list_df,colnames)
mat_comb = expand.grid(column_names)
mat_total = do.call(cbind,list_df)
vec_cor = rep(NA,nrow(mat_comb))
for(i in 1:nrow(mat_comb)){
vec_cor[i] = sum(cor(mat_total[,as.character(unlist(mat_comb[i,]))]))
}
pos_max_temp = rev(sort(vec_cor))[1:n]
pos_max = vec_cor%in%pos_max_temp
comb_max_cor = mat_comb[pos_max,]
return(comb_max_cor)
}
You could use comb function:
fun = function(x){
nm = paste0(names(x),collapse="")
if(!grepl("(.)\\d.*\\1",nm,perl = T))
setNames(sum(cor(x)),nm)
}
unlist(combn(a,4,fun,simplify = FALSE))[1:3]#Only printed the first 3
a1b1c1d1 a1b1c1d2 a1b1c1d3
3.246442 4.097532 3.566949
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d1)))
[1] 3.246442
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d2)))
[1] 4.097532
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d3)))
[1] 3.566949
Here is a function we can use to get n non-repeating columns from each data frame to get the max total correlation:
func <- function(n, ...){
list.df <- list(...)
n.df <- length(list.df)
# 1) First get the correlations
get.two.df.cors <- function(df1, df2) apply(df1, 2,
function(x) apply(df2, 2, function(y) cor(x,y))
)
cor.combns <- lapply(list.df, function(x)
lapply(list.df, function(y) get.two.df.cors(x,y))
)
# 2) Define function to help with aggregating the correlations.
# We will call them for different combinations of selected columns from each df later
# cmbns: given a df corresponding columns to be selected each data frame
# (i-th row corresponds to i-th df),
# return the "total correlation"
get.cmbn.sum <- function(cmbns, cor.combns){
# a helper matrix to help aggregation
# each row represents which two data frames we want to get the correlation sums
df.df <- t(combn(seq(n.df), 2, c))
# convert to list of selections for each df
cmbns <- split(cmbns, seq(nrow(cmbns)))
sums <- apply(df.df, 1,
function(dfs) sum(
cor.combns[[dfs[1]]][[dfs[2]]][cmbns[[dfs[2]]], cmbns[[dfs[1]]]]
)
)
# sum of the sums give the "total correlation"
sum(sums)
}
# 3) Now perform the aggragation
# get the methods of choosing n columns from each of the k data frames
if (n==1) {
cmbns.each.df <- lapply(list.df, function(df) matrix(seq(ncol(df)), ncol=1))
} else {
cmbns.each.df <- lapply(list.df, function(df) t(combn(seq(ncol(df)), n, c)))
}
# get all unique selection methods
unique.selections <- Reduce(function(all.dfs, new.df){
all.dfs.lst <- rep(list(all.dfs), nrow(new.df))
all.new.rows <- lapply(seq(nrow(new.df)), function(x) new.df[x,,drop=F])
for(i in seq(nrow(new.df))){
for(j in seq(length(all.dfs.lst[[i]]))){
all.dfs.lst[[i]][[j]] <- rbind(all.dfs.lst[[i]][[j]], all.new.rows[[i]])
}
}
do.call(c, all.dfs.lst)
}, c(list(list(matrix(numeric(0), nrow=0, ncol=n))), cmbns.each.df))
# for each unique selection method, calculate the total correlation
result <- sapply(unique.selections, get.cmbn.sum, cor.combns=cor.combns)
return( unique.selections[[which.max(result)]] )
}
And now we have:
# n = 1
func(1, setA, setB, setC, setD)
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
# [4,] 2
# n = 2
func(2, setA, setB, setC, setD)
# [,1] [,2]
# [1,] 1 2
# [2,] 2 3
# [3,] 2 3
# [4,] 2 3
I have two data frames of same number of columns (but not rows) df1 and df2. For each row in df2, I was able to find the best (and second best) matching rows from df1 in terms of hamming distance, in my previous post. In that post, we have been using the following example data:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
I now need to compute the number of bits equal to 1 for:
each row in df2
the best matching rows in df1
the second matching rows in df1
The number of bits equal to 1 of an integer a maybe computed as
sum(as.integer(intToBits(a)))
And I have applied this to #ZheyuanLi's original function, so I have got item 1>. However I'm unable to apply the same logic to get item 2> and 3>, by simple modification of #ZheyuanLi's function.
Below are the functions from #ZheyuanLi's with modification:
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
foo <- function(df1, df2, p = 2) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
sb <- integer(n)
k <- 1:p
for (i in 1:n) {
set.bits <- sum(as.integer(intToBits(yt[,i])))
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
sb[i] <- set.bits
k <- k + p
}
## recode "id", "d" and "sb" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
sb <- as.data.frame(matrix(sb, ncol = 1)) ## no need for byrow as you have only 1 column
colnames(sb) <- "set.bits.1"
list(id = id, d = d, sb = sb)
}
Running these gives:
> foo(df1, df2)
$id
min1 min2 ## row id for best/second best match in df1
1 1 4
2 2 3
3 5 2
$d
mindist.1 mindist.2 ## minimum 2 hamming distance
1 2 2
2 1 3
3 1 3
$sb
set.bits.1 ## number of bits equal to 1 for each row of df2
1 3
2 2
3 4
OK, after reading through while re-editing your question (many times!), I think I know what you want. Essentially we need change nothing to hmd(). Your required items 1>, 2>, 3> can all be computed after the for loop in foo().
To get item 1>, which you called sb, we can use a tapply(). However, your computation of sb along the for loop is fine, so I will not change it. In the following, I will demonstrate the basic procedure to get item 2> and item 3>.
The id vector inside foo() stores all matching rows in df1:
id <- c(1, 4, 2, 3, 5, 2)
so we can simply extract those rows of df1 (actually, columns of xt), to compute the number of bits equal to 1. As you can see, there are lots of duplicity in id, so we can only computes on unique(id):
id0 <- sort(unique(id))
## [1] 1 2 3 4 5
We now extract those subset columns of xt:
sub_xt <- xt[, id0]
## [,1] [,2] [,3] [,4] [,5]
## V1 9 3 10 5 6
## V2 2 4 8 7 1
To compute the number of bits equal to 1 for each column of sub_xt, we again use tapply() and vectorized approach.
rawbits <- as.integer(intToBits(as.numeric(sub_xt))) ## convert sub_xt to binary
sbxt0 <- unname(tapply(X = rawbits,
INDEX = rep(1:length(id0), each = length(rawbits) / length(id0)),
FUN = sum))
## [1] 3 3 3 5 3
Now we need to map sbxt0 to sbxt:
sbxt <- sbxt0[match(id, id0)]
## [1] 3 5 3 3 3 3
Then we can convert sbxt to a data frame sb1:
sb1 <- as.data.frame(matrix(sbxt, ncol = p, byrow = TRUE))
colnames(sb1) <- paste(paste0("min.", 1:p), "set.bits.1", sep = ".")
## min.1.set.bits.1 min.2.set.bits.1
## 1 3 5
## 2 3 3
## 3 3 3
Finally we can assemble these things up:
foo <- function(df1, df2, p = 2) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
sb2 <- integer(n)
k <- 1:p
for (i in 1:n) {
set.bits <- sum(as.integer(intToBits(yt[,i])))
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
sb2[i] <- set.bits
k <- k + p
}
## compute "sb1"
id0 <- sort(unique(id))
sub_xt <- xt[, id0]
rawbits <- as.integer(intToBits(as.numeric(sub_xt))) ## convert sub_xt to binary
sbxt0 <- unname(tapply(X = rawbits,
INDEX = rep(1:length(id0), each = length(rawbits) / length(id0)),
FUN = sum))
sbxt <- sbxt0[match(id, id0)]
sb1 <- as.data.frame(matrix(sbxt, ncol = p, byrow = TRUE))
colnames(sb1) <- paste(paste0("min.", 1:p), "set.bits.1", sep = ".")
## recode "id", "d" and "sb2" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
sb2 <- as.data.frame(matrix(sb2, ncol = 1)) ## no need for byrow as you have only 1 column
colnames(sb2) <- "set.bits.1"
list(id = id, d = d, sb1 = sb1, sb2 = sb2)
}
Now, running foo(df1, df2) gives:
> foo(df1,df2)
$id
min.1 min.2
1 1 4
2 2 3
3 5 2
$d
mindist.1 mindist.2
1 2 2
2 1 3
3 1 3
$sb1
min.1.set.bits.1 min.2.set.bits.1
1 3 5
2 3 3
3 3 3
$sb2
set.bits.1
1 3
2 2
3 4
Note that I have renamed the sb you used to sb2.
I have a matrix temp1 (dimensions Nx16) (generally, NxM)
I would like to sum every k columns in each row to one value.
Here is what I got to so far:
cbind(rowSums(temp1[,c(1:4)]), rowSums(temp1[,c(5:8)]), rowSums(temp1[,c(9:12)]), rowSums(temp1[,c(13:16)]))
There must be a more elegant (and generalized) method to do it.
I have noticed similar question here:
sum specific columns among rows
couldn't make it work with Ananda's solution;
Got following error:
sapply(split.default(temp1, 0:(length(temp1)-1) %/% 4), rowSums)
Error in FUN(X[[1L]], ...) :
'x' must be an array of at least two dimensions
Please advise.
You can use by:
do.call(cbind, by(t(temp1), (seq(ncol(temp1)) - 1) %/% 4, FUN = colSums))
If the dimensions are equal for the sub matrices, you could change the dimensions to an array and then do the rowSums
m1 <- as.matrix(temp1)
n <- 4
dim(m1) <- c(nrow(m1), ncol(m1)/n, n)
res <- matrix(rowSums(apply(m1, 2, I)), ncol=n)
identical(res[,1],rowSums(temp1[,1:4]))
#[1] TRUE
Or if the dimensions are unequal
t(sapply(seq(1,ncol(temp2), by=4), function(i) {
indx <- i:(i+3)
rowSums(temp2[indx[indx <= ncol(temp2)]])}))
data
set.seed(24)
temp1 <- as.data.frame(matrix(sample(1:20, 16*4, replace=TRUE), ncol=16))
set.seed(35)
temp2 <- as.data.frame(matrix(sample(1:20, 17*4, replace=TRUE), ncol=17))
Another possibility:
x1<-sapply(1:(ncol(temp1)/4),function(x){rowSums(temp1[,1:4+(x-1)*4])})
## check
x0<-cbind(rowSums(temp1[,c(1:4)]), rowSums(temp1[,c(5:8)]), rowSums(temp1[,c(9:12)]), rowSums(temp1[,c(13:16)]))
identical(x1,x0)
# TRUE
Here's another approach. Convert the matrix to an array and then use apply with sum.
n <- 4
apply(array(temp1, dim=c(dim(temp1)/c(1,n), n)), MARGIN=c(1,3), FUN=sum)
Using #akrun's data
set.seed(24)
temp1 <- matrix(sample(1:20, 16*4, replace=TRUE), ncol=16)
a function which sums matrix columns with each group of size n columns
set.seed(1618)
mat <- matrix(rnorm(24 * 16), 24, 16)
f <- function(mat, n = 4) {
if (ncol(mat) %% n != 0)
stop()
cols <- split(colSums(mat), rep(1:(ncol(mat) / n), each = n))
## or use this to have n mean the number of groups you want
# cols <- split(colSums(mat), rep(1:n, each = ncol(mat) / n))
sapply(cols, sum)
}
f(mat, 4)
# 1 2 3 4
# -17.287137 -1.732936 -5.762159 -4.371258
c(sum(mat[,1:4]), sum(mat[,5:8]), sum(mat[,9:12]), sum(mat[,13:16]))
# [1] -17.287137 -1.732936 -5.762159 -4.371258
More examples:
## first 8 and last 8 cols
f(mat, 8)
# 1 2
# -19.02007 -10.13342
## each group is 16 cols, ie, the entire matrix
f(mat, 16)
# 1
# -29.15349
sum(mat)
# [1] -29.15349
I've got a set of objects, let's say with the IDs 'A' to 'J'. And I've got two data frames which look the following way (as you can see, the second data frame is symmetric):
df1 <- data.frame(ID = LETTERS[1:5], Var = c(9,13,15,11,28))
df2 <- as.data.frame(matrix(data = c(NA,42,83,74,84,42,NA,26,69,9,83,26,NA,67,95,74,69,67,NA,6,84,9,95,6,NA), ncol = 5, nrow = 5, dimnames = list(df1$ID, df1$ID)))
For example, take the objects 'B' and 'E'. I want to know: Is 13+28 (from df1) less than 9 (from df2)? I'd like to know this for all pairs of objects. The output should be
(a) a logical data frame structured like df2 and
(b) the number of "TRUE" values.
Most of the time I will only need result (b), but sometimes I would also need (a). So if (b) can be calculated without (a) and if this would be significantly faster, then I'd like to have both algorithms in order to select the suitable one dependent on which output I need to answer a particular question.
I'm comparing around 2000 objects, so the algorithm should be reasonably fast. So far I've been only able to implement this with two nested for-loops which is awfully slow. I bet there is a much nicer way to do this, maybe exploiting vectorisation.
This is what it currently looks like:
df3 <- as.data.frame(matrix(data = NA, ncol = nrow(df1), nrow = nrow(df1),
dimnames = list(df1$ID, df1$ID)))
for (i in 2:nrow(df3)){
for (j in 1:(i-1)){
sum.val <- df1[df1$ID == rownames(df3)[i], "Var"] + df1[df1$ID == names(df3)[j], "Var"]
df3[i,j] <- sum.val <= df2[i,j]
}
}
#
Is this what you want?
df3 <- outer(df1$Var, df1$Var, "+")
df3
df4 <- df3 < df2
df4
sum(df4, na.rm = TRUE)
Here's one way to do it...
# Get row and column indices
ind <- t( combn( df1$ID , 2 ) )
# Get totals
tot <- with( df1 , Var[ match( ind[,1] , ID ) ] + Var[ match( ind[,2] , ID ) ] )
# Make df2 a matrix
m <- as.matrix( df2 )
# Total number of values is simply
sum( m[ ind ] > tot )
#[1] 7
# Find which values in upper triangle part of the matrix exceed those from df1 (1 = TRUE)
m[upper.tri(m)] <- m[ ind ] > tot
# A B C D E
#A NA 1 1 1 0
#B 42 NA 1 0 1
#C 83 26 NA 1 1
#D 74 69 67 NA 0
#E 84 9 95 6 NA
This will do what you want.
# Generate the data
df1 <- data.frame(ID = LETTERS[1:5], Var = c(9,13,15,11,28))
df2 <- as.data.frame(matrix(data = c(NA,42,83,74,84,42,NA,26,
69,9,83,26,NA,67,95,74,69,
67,NA,6,84,9,95,6,NA),
ncol = 5, nrow = 5,
dimnames = list(df1$ID, df1$ID)))
# Define a pairwise comparison index matrix using 'combn'
idx <- combn(nrow(df1), 2)
# Create a results matrix
res <- matrix(NA, ncol = ncol(df2), nrow = nrow(df2))
# Loop through 'idx' for each possible comparison (without repeats)
for(i in 1:ncol(idx)){
logiTest <- (df1$Var[idx[1,i]] + df1$Var[idx[2,i]]) < df2[idx[1,i], idx[2,i]]
res[idx[1,i], idx[2, i]] <- logiTest
res[idx[2,i], idx[1, i]] <- logiTest
}
# Count the number of 'true' comparisons
nTrues <- sum(res, na.rm = TRUE)/2
The code simply uses a pairwise comparison index (idx) to define which elements in both df1 and df2 are to be used in each iteration of the 'for loop'. It then uses this same index to define where in the 'res' matrix the answer to the logical test is to be written.
N.B. This code will break down if the order of elements in df1 and df2 are not the same. In such cases, it would be appropriate to use the actual letters to define which values to compare.