Related
Taking into account your past answers, I've changed for the following:
n <- 100
B <- 20
S <- 50
alpha <- 0.3
beta <- 1.2
theta <- alpha*beta
for (i in 1:S) {
###
sim_original_samples <- rgamma(n, alpha, beta) # for each S, we have a sample of 100 observations
sim_original_samples_X_bar <- mean(sim_original_samples) # for each dataset, compute the sample mean and input it
sim_bs_samples_X_bar <- matrix(0,B,1)
# in the same loop we are going to compute the sample mean per bootstrap per original sample i
####
####
for (j in 1:B) {
sim_bs_samples <- sample(sim_original_samples,n,replace=TRUE)
# for each original sample, we are going to draw B times a bootstrap sample
sim_bs_samples_X_bar[j] <- mean(sim_bs_samples)
# all the elements of this matrix should be the bootstrap sample mean
var_sim_bs_samples <- matrix(0,B,1)
var_sim_bs_samples[j] <- (sim_bs_samples_X_bar[j] - sim_original_samples_X_bar)^2
se_sim_bs_samples <- sqrt((1/B*sum(var_sim_bs_samples)))
}
####
####
# now we want to compute the asymptotic CI of i)
z <- 1.96
var_gamma <- alpha*beta^2/n
CI_sim_asy_norm <- matrix(ncol = 3, nrow = S) # create a vector for the CI
names <- c("Lower bound", "Upper bound", "teta covered")
colnames(CI_sim_asy_norm) <- names
#
CI_sim_asy_norm[i,1] <- theta - z*sqrt(var_gamma)
CI_sim_asy_norm[i,2] <- theta + z*sqrt(var_gamma)
CI_sim_asy_norm[i,3] <- theta >= CI_sim_asy_norm[i,1] & theta <= CI_sim_asy_norm[i,2]
# check whether the true parameter of interest is covered
####
####
# do the same for the asymptotic BS CI of ii)
CI_sim_asy_bs <- matrix(ncol = 3, nrow = S)
colnames(CI_sim_asy_bs) <- names
CI_sim_asy_bs[i,1] <- sim_original_samples_X_bar - z*se_sim_bs_samples
CI_sim_asy_bs[i,2] <- sim_original_samples_X_bar + z*se_sim_bs_samples
CI_sim_asy_bs[i,3] <- theta >= CI_sim_asy_bs[i,1] & theta <= CI_sim_asy_bs[i,2]
####
####
# do the same for the percentile BS CI of iii) assuming B = 1000 for simplicity
sim_bs_samples_X_bar_sorted <- sort(sim_bs_samples_X_bar, decreasing=FALSE)
CI_sim_percentile <- matrix(ncol = 3, nrow = S)
colnames(CI_sim_percentile) <- names
CI_sim_percentile[i,1] <- sim_bs_samples_X_bar_sorted[1000*(0.05/2)]
CI_sim_percentile[i,2] <- sim_bs_samples_X_bar_sorted[1000*((1-0.05)/2)]
CI_sim_percentile[i,3] <- theta >= CI_sim_percentile[i,1] & theta <= CI_sim_percentile[i,2]
####
}
The issue I have now, is that only the last row of the CI is filled (when filled) whereas it should be filled for all rows.
Where is the issue ? I cannot see it.
That is, for each original sample i, I draw B bootstrap samples.
For each, original sample i, I want to construct confidence intervals.
For each confidence intervals I want to know whether the true parameter (theta) has been contained in each of the CI.
Hence, I'd have 50 confidence intervals.
For the bootstrap one it is based on the estimates of the 20 simulations (per original sample).
Many thanks
The Question
Let's divide your question into a two parts:
How to create the data: samples, re-samples, means, etc.
How to create a multi-dimensional object
1. Creating the data
Configuration from your question
S <- 5
n <- 100
B <- 2
alpha <- 0.3
beta <- 1.2
Sample and re-sample
require( tidyverse )
U <- map_dfc( 1:S, ~rgamma( n, alpha, beta ))
Ubar <- map_dfc( 1:S, mean )
V <- map_dfc( 1:S, ~sample( U[[ . ]], n, replace = TRUE ))
Vbar <- map_dfc( 1:S, mean )
What shape are these?
dim( U )
# 100 5
dim( V )
# 100 5
dim( Ubar )
# 1 5
dim( Vbar )
# 1 5
Now, what did you want to stack? (and why?)
2. How to create a multi-dimensional object
Sometimes it can be helpful to pack data into a multi-dimensional object, in order to facilitate slicing along axes, or to select specific elements.
Define the object
multi_dimensional <- array(
data = 0:( S * B * n )
, dim = c( S, B, n )
, dimnames = list( # <---- names are optional
paste0( 'X', 1:S )
, paste0( 'Y', 1:B )
, paste0( 'Z', 1:n )
)
)
dim( multi_dimensional )
# [1] 5 2 100
Slice, dice, and chop
multi_dimensional[ 1, 1, 1 ]
# [1] 0
multi_dimensional[ S, B, n ]
# [1] 999
multi_dimensional[ 1, 2, 1:10]
# Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 Z10
# 5 15 25 35 45 55 65 75 85 95
multi_dimensional[ , , 1:2 ]
# , , Z1
#
# Y1 Y2
# X1 0 5
# X2 1 6
# X3 2 7
# X4 3 8
# X5 4 9
#
# , , Z2
#
# Y1 Y2
# X1 10 15
# X2 11 16
# X3 12 17
# X4 13 18
# X5 14 19
multi_dimensional[ S, B, 99:100 ]
# Z99 Z100
# 989 999
Use the named dimensions, if you wish
multi_dimensional[ , , c( 'Z1', 'Z2' ) ]
# , , Z1
#
# Y1 Y2
# X1 0 5
# X2 1 6
# X3 2 7
# X4 3 8
# X5 4 9
#
# , , Z2
#
# Y1 Y2
# X1 10 15
# X2 11 16
# X3 12 17
# X4 13 18
# X5 14 19
multi_dimensional[ c( 'X1', 'X3', 'X5' ), 'Y2' , c( 'Z1', 'Z2' ) ]
# Z1 Z2
# X1 5 15
# X3 7 17
# X5 9 19
Assign new values to specific elements
multi_dimensional[ 5, 1, 29:30 ] <- c( 124.76, -5.0002 )
Now show the new values
multi_dimensional[ 5, 1, 29:30 ]
# Z29 Z30
# 124.8 -5.0
multi_dimensional[ 1:3, , 91:100 ] # slice off a particular 3 x 10 block
# (not shown, due to size)
Initial Answer
I think you misunderstoond the purpose of matrix datatype as in R matrices can't store complex objects such as other matrices, they are limited to: double/numeric, integer, logical, character, complex and raw.
Since you know the sizes of most of those data structures you should declare them beforehand AND outside the loops.
What you seem to want is to be able to store a list of B matrices of arbitrary size (1 by n) that are generated on the second loop. You can declare an empty list and start adding the matrices in the second loop to it with something like this:
#You should declare this outside the loops.
matrix_j <- vector(mode='list', length=B)
#Then on the inner loop you can use [[]] to add elements to a list
for (j in 1:B) {
matrix_j[[j]] <- sample(sim_original_samples[i],n,replace=TRUE)
or if you want an empty list of size 0, you can do matrix_j <- list() instead.
Next i didn't get if you want to compute the mean of each sample inside the list or if you want to compute the mean of the whole set of numbers, so:
First one would require you to you use the list apply function lapply,like this: lapply(matrix_j,mean), which would return a list in which each element is the mean of the the element in the same position of matrix_j.
For the second possibility, i think it would be more appropriate to combine the list elements into one simpler data structure and then compute the mean.
For your last problem, it seems to me that using lists (lists of lists) would solve your issue.
I'd create a big empty list and then add other lists as elements, as lists are allowed to contain other lists.
Answer to edits:
You are redeclaring lots of matrices inside loops. This is a bad practice as every time you do this, you're assigning the initial values to them, so don't do that if you want to keep data from previous iterations.
Consider this part of your code:
for (j in 1:B) {
sim_bs_samples <- sample(sim_original_samples,n,replace=TRUE)
# for each original sample, we are going to draw B times a bootstrap sample
sim_bs_samples_X_bar[j] <- mean(sim_bs_samples)
# all the elements of this matrix should be the bootstrap sample mean
var_sim_bs_samples <- matrix(0,B,1)
var_sim_bs_samples[j] <- (sim_bs_samples_X_bar[j] - sim_original_samples_X_bar)^2
se_sim_bs_samples <- sqrt((1/B*sum(var_sim_bs_samples)))
}
Every time the var_sim_bs_samples <- matrix(0,B,1) line runs inside the loop, it substitutes the current matrix for a new matrix full of zeros and then the following line assigns something to its i-th line.
This declaration shouldn't happen inside the loop to avoid this behavior, hence why I told you to create a list and store each new matrix inside it OR move the declaration outside the loop and keep adding things to each line. So, to fix that, you could move the declaration to outside the loop like i've done here:
var_sim_bs_samples <- matrix(0,B,1)
for (j in 1:B) {
sim_bs_samples <- sample(sim_original_samples,n,replace=TRUE)
# for each original sample, we are going to draw B times a bootstrap sample
sim_bs_samples_X_bar[j] <- mean(sim_bs_samples)
# all the elements of this matrix should be the bootstrap sample mean
var_sim_bs_samples[j] <- (sim_bs_samples_X_bar[j] - sim_original_samples_X_bar)^2
se_sim_bs_samples <- sqrt((1/B*sum(var_sim_bs_samples)))
}
The reason only the last line of multiple matrices is present is because you're redeclaring (i.e. erasing it) as an empty matrix everytime with <- matrix(ncol = 3, nrow = S), so you're kinda emptying the matrix and then adding/assigning something to the i-th position, the last one happens to do it to the last position since i goes from 1 to S.
Second Edit:
Just move the declarations outside of the loops where you're using them, like this:
n <- 100
B <- 20
S <- 50
alpha <- 0.3
beta <- 1.2
theta <- alpha*beta
CI_sim_asy_norm <- matrix(ncol = 3, nrow = S)
CI_sim_asy_bs <- matrix(ncol = 3, nrow = S)
CI_sim_percentile <- matrix(ncol = 3, nrow = S)
for (i in 1:S) {
sim_original_samples <- rgamma(n, alpha, beta)
sim_original_samples_X_bar <- mean(sim_original_samples)
sim_bs_samples_X_bar <- matrix(0,B,1)
var_sim_bs_samples <- matrix(0,B,1)
for (j in 1:B) {
sim_bs_samples <- sample(sim_original_samples,n,replace=TRUE)
Your CI_sim_percentile will still be full of NAs, because this command sim_bs_samples_X_bar_sorted[1000*(0.05/2)] is just trying to access an index of sim_bs_samples_X_bar_sorted where there is no data and R assumes NA for that.
I am running a Montecarlo simulation of a multinomial logit. Therefore I have a function that generates the data and estimates the model. Additionally, I want to generate different datasets over a grid of values. In particular, changing both the number of individuals (n.indiv) and the number of answers by each individual (n.choices).
So far, I have managed to solve it, but at some point, I incurred into a nested for-loop structure over a grid search of the possible values for the number of individuals (n.indiv_list) and the number of answers by each individual(n.choices_list). Finally, I am quite worried about the efficiency of the usage of my last bit of code with the double for-loop structure running on the combinations of the possible values. Probably there is a vectorized way to do it that I am missing (or maybe not?).
Finally, and this is mostly a matter of style, I managed to arrive a multiples objects that contain the models from the combinations of the grid search with informative names, but also would be great if I could collapse all of them in a list but with the current structure, I am not sure how to do it. Thank you in advance!
1) Function that generates data and estimates the model.
library(dplyr)
library(VGAM)
library(mlogit)
#function that generates the data and estimates the model.
mlogit_sim_data <- function(...){
# generating number of (n.alter) X (n.choices)
df <- data.frame(id= rep(seq(1,n.choices ),n.alter ))
# id per individual
df <- df %>%
group_by(id) %>%
mutate(altern = sequence(n()))%>%
arrange(id)
#Repeated scheme for each individual + id_ind
df <- cbind(df[rep(1:nrow(df), n.indiv), ], id_ind = rep(1:n.indiv, each = nrow(df)))
## creating attributes
df<- df %>%
mutate(
x1=rlnorm(n.indiv*n.alter),
x2=rlnorm(n.indiv*n.alter),
)%>%
group_by(altern) %>%
mutate(
id_choice = sequence(n()))%>%
group_by(id_ind) %>%
mutate(
z1 = rpois(1,lambda = 25),
z2 = rlnorm(1,meanlog = 5, sdlog = 0.5),
z3 = ifelse(runif(1, min = 0 , max = 1) > 0.5 , 1 , 0)
)
# Observed utility
df$V1 <- with(df, b1 * x1 + b2 * x2 )
#### Generate Response Variable ####
fn_choice_generator <- function(V){
U <- V + rgumbel(length(V), 0, 1)
1L * (U == max(U))
}
# Using fn_choice_generator to generate 'choice' columns
df <- df %>%
group_by(id_choice) %>%
mutate(across(starts_with("V"),
fn_choice_generator, .names = "choice_{.col}")) %>% # generating choice(s)
select(-starts_with("V")) %>% ##drop V variables.
select(-c(id,id_ind))
tryCatch(
{
model_result <- mlogit(choice_V1 ~ 0 + x1 + x2 |1 ,
data = df,
idx = c("id_choice", "altern"))
return(model_result)
},
error = function(e){
return(NA)
}
)
}
2) Grid search over possible combinations of the data
#List with the values that varies in the simulation
#number of individuals
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
# Values that remains constant across simulations
#set number of alternatives
n.alter <- 3
## Real parameters
b1 <- 1
b2 <- 2
#Number of reps
nreps <- 10
#Set seed
set.seed(777)
#iteration over different values in the simulation
for(i in n.indiv_list) {
for(j in n.choices_list) {
n.indiv <- i
n.choices <- j
assign(paste0("m_ind_", i, "_choices_", j), lapply(X = 1:nreps, FUN = mlogit_sim_data))
}
}
You can vectorize using the map2 function of the purrr package:
library(tidyverse)
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
l1 <- length(n.indiv_list)
l2 <- length(n.choices_list)
v1 <- rep(n.indiv_list, each = l2)
v2 <- rep(n.choices_list, l1) #v1, v2 generate all pairs
> v1
[1] 1 1 1 1 1 15 15 15 15 15 100 100 100 100 100 500 500 500 500 500
> v2
[1] 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10
result <- map2(v1, v2, function(v1, v2) assign(paste0("m_ind_", v1, "_choices_", v2), lapply(X = 1:nreps, FUN = mlogit_sim_data)))
result will be a list of your function outputs.
Let's assume four data frames, each with 3 vectors, e.g.
setA <- data.frame(
a1 = c(6,5,2,4,5,3,4,4,5,3),
a2 = c(4,3,1,4,5,1,1,6,3,2),
a3 = c(5,4,5,6,4,6,5,5,3,3)
)
setB <- data.frame(
b1 = c(5,3,4,3,3,6,4,4,3,5),
b2 = c(4,3,1,3,5,2,5,2,5,6),
b3 = c(6,5,4,3,2,6,4,3,4,6)
)
setC <- data.frame(
c1 = c(4,4,5,5,6,4,2,2,4,6),
c2 = c(3,3,4,4,2,1,2,3,5,4),
c3 = c(4,5,4,3,5,5,3,5,5,6)
)
setD <- data.frame(
d1 = c(5,5,4,4,3,5,3,5,5,4),
d2 = c(4,4,3,3,4,3,4,3,4,5),
d3 = c(6,5,5,3,3,4,2,5,5,4)
)
I'm trying to find n number of vectors in each data frame, that have the highest correlation among each other. For this simple example, let's say want to find the n = 1 vectors in each of the k = 4 data frames, that show the overall strongest, positive correlation cor().
I'm not interested in the correlation of vectors within a data frame, but the correlation between data frames, since i wish to pick 1 variable from each set.
Intuitively, I would sum all the correlation coefficients for each combination, i.e.:
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b2, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b1, setC$c2, setC$d1)))
... # and so on...
...but this seems like brute-forcing a solution that might be solvable more elegantly, with some kind of clustering-technique?
Anyhow, I was hoping to find a dynamic solution like function(n = 1, ...) where (... for data frames) which would return a list of the highest correlating vector names.
Base on your example I would not go with a really complicated algorithm unless your actual data is huge. This is a simple approach I think gets what you want.
So base on your 4 data frames a creates the list_df and then in the function I just generate all the possible combinations of variables an calculate their correlation. At the end I select the n combinations with highest correlation.
list_df = list(setA,setB,setC,setD)
CombMaxCor = function(n = 1,list_df){
column_names = lapply(list_df,colnames)
mat_comb = expand.grid(column_names)
mat_total = do.call(cbind,list_df)
vec_cor = rep(NA,nrow(mat_comb))
for(i in 1:nrow(mat_comb)){
vec_cor[i] = sum(cor(mat_total[,as.character(unlist(mat_comb[i,]))]))
}
pos_max_temp = rev(sort(vec_cor))[1:n]
pos_max = vec_cor%in%pos_max_temp
comb_max_cor = mat_comb[pos_max,]
return(comb_max_cor)
}
You could use comb function:
fun = function(x){
nm = paste0(names(x),collapse="")
if(!grepl("(.)\\d.*\\1",nm,perl = T))
setNames(sum(cor(x)),nm)
}
unlist(combn(a,4,fun,simplify = FALSE))[1:3]#Only printed the first 3
a1b1c1d1 a1b1c1d2 a1b1c1d3
3.246442 4.097532 3.566949
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d1)))
[1] 3.246442
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d2)))
[1] 4.097532
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d3)))
[1] 3.566949
Here is a function we can use to get n non-repeating columns from each data frame to get the max total correlation:
func <- function(n, ...){
list.df <- list(...)
n.df <- length(list.df)
# 1) First get the correlations
get.two.df.cors <- function(df1, df2) apply(df1, 2,
function(x) apply(df2, 2, function(y) cor(x,y))
)
cor.combns <- lapply(list.df, function(x)
lapply(list.df, function(y) get.two.df.cors(x,y))
)
# 2) Define function to help with aggregating the correlations.
# We will call them for different combinations of selected columns from each df later
# cmbns: given a df corresponding columns to be selected each data frame
# (i-th row corresponds to i-th df),
# return the "total correlation"
get.cmbn.sum <- function(cmbns, cor.combns){
# a helper matrix to help aggregation
# each row represents which two data frames we want to get the correlation sums
df.df <- t(combn(seq(n.df), 2, c))
# convert to list of selections for each df
cmbns <- split(cmbns, seq(nrow(cmbns)))
sums <- apply(df.df, 1,
function(dfs) sum(
cor.combns[[dfs[1]]][[dfs[2]]][cmbns[[dfs[2]]], cmbns[[dfs[1]]]]
)
)
# sum of the sums give the "total correlation"
sum(sums)
}
# 3) Now perform the aggragation
# get the methods of choosing n columns from each of the k data frames
if (n==1) {
cmbns.each.df <- lapply(list.df, function(df) matrix(seq(ncol(df)), ncol=1))
} else {
cmbns.each.df <- lapply(list.df, function(df) t(combn(seq(ncol(df)), n, c)))
}
# get all unique selection methods
unique.selections <- Reduce(function(all.dfs, new.df){
all.dfs.lst <- rep(list(all.dfs), nrow(new.df))
all.new.rows <- lapply(seq(nrow(new.df)), function(x) new.df[x,,drop=F])
for(i in seq(nrow(new.df))){
for(j in seq(length(all.dfs.lst[[i]]))){
all.dfs.lst[[i]][[j]] <- rbind(all.dfs.lst[[i]][[j]], all.new.rows[[i]])
}
}
do.call(c, all.dfs.lst)
}, c(list(list(matrix(numeric(0), nrow=0, ncol=n))), cmbns.each.df))
# for each unique selection method, calculate the total correlation
result <- sapply(unique.selections, get.cmbn.sum, cor.combns=cor.combns)
return( unique.selections[[which.max(result)]] )
}
And now we have:
# n = 1
func(1, setA, setB, setC, setD)
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
# [4,] 2
# n = 2
func(2, setA, setB, setC, setD)
# [,1] [,2]
# [1,] 1 2
# [2,] 2 3
# [3,] 2 3
# [4,] 2 3
Considering the following table df, with categorical variables noted x1 and x2 and numerical measurements noted y1, y2 and y3:
df <- data.frame(x1=sample(letters[1:3], 20, replace=TRUE),
x2=sample(letters[4:6], 20, replace=TRUE),
y1=rnorm(20), y2=rnorm(20), y3=rnorm(20))
I'd like to apply on it a function of the 3 numerical measurements y with respect to the categorical variables x. For example the following function, where the input y is a table of 3 columns, which should output one new column:
f <- function(y){ sum((y[,1] - y[,2]) / y[,3]) }
I tried it with aggregate, dplyr, summarizeBy.. without success as it seems that for every method, mixing the inputs columns is not an option. Any idea on how to do that with such kind of functions (i.e. taking advantage of aggregation)?
aggregate(data = df, y1 + y2 + y3 ~ x1 + x2, FUN = f)
To clarify, the expected result can be obtained with something like:
groups <- unique(df[,c("x1", "x2")]) # coocurences of explanatory variables
res <- c()
for (i in 1:nrow(groups)){ # get the subtables
temp <- df[df$x1 == groups[i,1] & df$x2 == groups[i,2], c("y1", "y2", "y3")]
res <- c(res, f(temp)) # apply function on subtables
}
groups$res <- res # aggregate results
Which is not that fat for this simple toy example but very impractical with more complex data.
The problem is on th input side of your function. The way you specified it, it expects a dataframe.
A possible slution is to feed the function a list of columns. With a small change to your function:
f <- function(y) sum((y[[1]] - y[[2]]) / y[[3]])
You can now use it in a dplyr-chain:
df %>%
group_by(x1, x2) %>%
summarise(sum_y = f(list(y1, y2, y3)))
which gives:
# A tibble: 9 x 3
# Groups: x1 [?]
x1 x2 sum_y
<fct> <fct> <dbl>
1 a d 1.20
2 a e 0.457
3 a f -9.46
4 b d -1.11
5 b e -0.176
6 b f -1.34
7 c d -0.994
8 c e 3.38
9 c f -2.63
I have some large data, which partly consists of very similar variables. Some variables have missing values (e.g. x3 and x5 in the example below) and some variables are similar, but with different labels (e.g. x2 and x5). In order to clean my data, I want to identify and eventually delete these similar variables. I am trying to write a function, which returns the column names of all similar variable pairs. Here is some exemplifying data:
# Example data
set.seed(222)
N <- 100
x1 <- round(rnorm(N, 0, 10))
x2 <- round(rnorm(N, 10, 20))
x3 <- x1
x3[sample(1:N, 7)] <- NA
x4 <- x1
x4[sample(1:N, 5)] <- round(rnorm(5, 0, 10))
x5 <- x2
x5 <- paste("A", x5, sep = "")
x5[sample(1:N, 15)] <- NA
df <- data.frame(x1, x2, x3, x4, x5)
df$x1 <- as.character(df$x1)
df$x2 <- as.character(df$x2)
df$x3 <- as.character(df$x3)
df$x4 <- as.character(df$x4)
df$x5 <- as.character(df$x5)
head(df)
As you can see, x1, x3, and x4 are very similar; and x2 and x5 are very similar as well. My function should print a list, which includes all pairs with the same values in 80% or more of the cases. Here is what I got so far:
# My attempt to write such a function
fun_clean <- function(data, similarity) {
output <- list()
data <- data[complete.cases(data), ]
for(i in 1:ncol(data)) {
if(i < ncol(data)) {
for(j in (i + 1):ncol(data)) {
similarity_ij <- sum(data[ , i] == data[ , j]) / nrow(data)
if(similarity_ij >= similarity) {
output[[length(output) + 1]] <- colnames(data)[c(i, j)]
}
}
}
}
output
}
fun_clean(data = df, similarity = 0.8)
I managed to identify the similarity of x1, x3, and x4. The similarity of x2 and x5 (i.e. similar variables with different labels) is not found by my function. Further, my function is very slow. Therefore, I have the following question:
Question: How could I identify all similar variables in a computationally efficient way?
In order to compare your columns, you need numeric values first. You can extract only the numeric values by using gsub() and then transform to numeric values. After this transformation, you'll be good to go:
df <- apply(df, 2, function(x) as.numeric( gsub("[^0-9]", "", x) ))
Now you can compare all columns by first using combn(5, 2) to get all pairs of columns you want to compare. Then you can use that to compare the columns and calculate the percentage of entries that are equal.
combs <- combn(ncol(df), 2)
res <- apply(combs, 2, function(x){
sum(df[, x[1]] == df[, x[2]], na.rm = TRUE)/nrow(df)
})
thresh <- 0.8
combs[, res > thresh]
# [,1] [,2] [,3] [,4]
# [1,] 1 1 2 3
# [2,] 3 4 5 4
So columns 1 & 3, 1 & 4, 2 & 5 and 3 & 4 are equal to each other in more than 80% of the cases.
Note: If one or both of the compared values have an NA, this will be considered as not a match!
In caret package there is a functionality to discover correlated variables and also variables that are lineal combinations of others:
http://topepo.github.io/caret/pre-processing.html