Calculate cohens d for all pairs of groups in dataframe - r

Using the r package "effsize" I am trying to calculate cohens d between all pairs of groups in my data outputting all the pairwise d estimates as a matrix. I have provided some test data to illustrate this. I would want a matrix of d estimates for all pairs of groups 1, 2, and 3.
I am struggling to find where to start with this. I know that it could be done using loops but since my real data contains 1000 groups each with 6000 data points I think this would be slow.
library("effsize")
test <- data.frame(
score=c(2,3,42,1,2,3,4,5,5,6,8,2),
group=c(1,1,1,1,2,2,2,2,3,3,3,3)
)
This would be similar functionality to what is provided for wilcox rank sum using pairwise.wilcox.test().

All you have to do is to note that function combn outputs the combinations of n elements taken k at a time and can also apply a function to each resulting combination. In this case the question asks for combinations of 2 groups at a time and the function fun is applied to each one.
fun <- function(x) {
cohen.d(x[[1]]$score, x[[2]]$score)
}
sp <- split(test, test$group)
cmb <- combn(sp, 2, fun)
cmb[, 1]
#[[1]]
#[1] "Cohen's d"
#
#[[2]]
#[1] "d"
#
#[[3]]
#[1] 0.5992954
#
#[[4]]
# lower upper
#-1.169345 2.367936
#
#[[5]]
#[1] 0.95
#
#[[6]]
#[1] medium
#Levels: negligible < small < medium < large
The code above can be written as a function that does all the work and returns a matrix.
cohen.d.pairwise.test <- function(DF, scoreCol, groupCol){
fun <- function(x) {
eff <- cohen.d(x[[1]][[scoreCol]], x[[2]][[scoreCol]])
c(eff[["estimate"]],
eff[["conf.int"]][1],
eff[["conf.int"]][2],
eff[["conf.level"]])
}
sp <- split(DF, DF[[groupCol]])
cmb <- combn(sp, 2, fun)
rownames(cmb) <- c("estimate", "lower", "upper", "conf.level")
t(cmb)
}
cohen.d.pairwise.test(test, scoreCol = "score", groupCol = "group")
# estimate lower upper conf.level
#[1,] 0.5992954 -1.169345 2.3679357 0.95
#[2,] 0.4732232 -1.281054 2.2275008 0.95
#[3,] -0.8795932 -2.691556 0.9323698 0.95

Related

Rolling correlations across multiple columns, some with NAs?

I have the below dataset, where I am trying to do a rolling 3 days correlation across x,y,z,a. So the code should do rolling correlations of xy,xz,xa, yx, yz,ya and so on. Also, as you can see below, the data for y and a is incomplete, but I would wish to do rolling correlations of them starting from the date where they first had values (i.e. id 3 and id 4).
How should I accomplish this? Don't know where to start...
set.seed(42)
n <- 10
dat <- data.frame(id=1:n,
date=seq.Date(as.Date("2020-12-22"), as.Date("2020-12-31"), "day"),
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
a=rnorm(n))
dat$y[1:2] <- NA
dat$a[1:3] <- NA
I am able to find this set of code from stack, but it only helps in finding the answer for 1st column and not all the columns
rollapplyr(x, 5, function(x) cor(x[, 1], x[, -1]), by.column = FALSE)
Create a data frame with only the columns wanted and then use rollapplyr with cor. cor takes a use= argument that specifies how missing values are to be handled. See ?cor for the values it can take since you may or may not wish to use the value we used below.
The result r is a matrix whose i-th row describes the correlation matrix of the 5 dat2 rows ending in and including row i. That is, matrix(r[i, ], 4, 4) is the correlation matrix of dat2[i-(4:0), ].
We can also create ar which is a 3d array which is such that ar[i,,] is the correlation matrix of the 5 rows of dat2 ending in and including row i.
That is these are equal for each i in 5, ..., nrow(dat2). (The first 4 rows of r are all NA since there do not exist 5 rows leading to those rows.)
1. cor(dat2[i-(4:0), ], use = "pairwise")
2. matrix(r[i, ], 4, 4)
3. ar[i,,]
We run checks for these equivalences for i=5 below.
library(zoo)
w <- 5
dat2 <- dat[c("x", "y", "z", "a")]
nr <- nrow(dat2)
nc <- ncol(dat2)
r <- rollapplyr(dat2, w, cor, use = "pairwise", by.column = FALSE, fill = NA)
colnames(r) <- paste(names(dat2)[c(row(diag(nc)))],
names(dat2)[c(col(diag(nc)))], sep = ".")
ar <- array(r, c(nr, nc, nc),
dimnames = list(NULL, names(dat2), names(dat2)))
# run some checks
cor5 <- cor(dat2[1:w, ], use = "pairwise") # cor of 1st w rows
# same except for names
all.equal(unname(cor5), matrix(r[w, ], nc))
## [1] TRUE
all.equal(cor5, ar[w,,])
## [1] TRUE
The above shows a matrix whose rows are strung out correlation matrices and a 3d array whose slices are correlation matrices. Another possibility for output is to create a list of correlation matrices.
lapply(1:nr, function(i) {
if (i >= w) cor(dat2[i-((w-1):0), ], use = "pairwise")
})
combn produces all the combinations.
cols <- c("x", "y", "z", "a")
combn(cols, 2)
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "x" "x" "x" "y" "y" "z"
# [2,] "y" "z" "a" "z" "a" "a"
combn has a function argument where you first na.omit all rows with NA's. Then subset with mapply over incrementing sequences 1:3 and calculate correlations, until nrow is reached.
w <- 3 ## size of the rolling window
combn(dat[cols], 2, function(x) {
X <- na.omit(x)
n <- nrow(X)
mapply(function(y, z) cor(X[y + z, 1], X[y + z, 2]), list(1:w), 0:(n - w))
}, simplify=FALSE)
# [[1]]
# [1] 0.5307784 -0.9874843 -0.8364802 0.2407730 0.3655328 -0.4458231
#
# [[2]]
# [1] 0.8121466 0.9652715 0.3304100 0.8278965 -0.1425097 0.5832558 0.9959705
# [8] 0.8696023
#
# [[3]]
# [1] 0.6733985 0.2194488 0.5593983 -0.6589249 -0.9291184
#
# [[4]]
# [1] 0.97528684 -0.90599558 -0.42319742 0.92882443 0.28058418 0.05427966
#
# [[5]]
# [1] -0.7815678 -0.7182037 -0.6698260 0.4592962 0.7452225
#
# [[6]]
# [1] 0.9721521 0.9343926 -0.3470329 -0.7237291 -0.6253825

Find combination of n vectors across k dataframes with highest correlation

Let's assume four data frames, each with 3 vectors, e.g.
setA <- data.frame(
a1 = c(6,5,2,4,5,3,4,4,5,3),
a2 = c(4,3,1,4,5,1,1,6,3,2),
a3 = c(5,4,5,6,4,6,5,5,3,3)
)
setB <- data.frame(
b1 = c(5,3,4,3,3,6,4,4,3,5),
b2 = c(4,3,1,3,5,2,5,2,5,6),
b3 = c(6,5,4,3,2,6,4,3,4,6)
)
setC <- data.frame(
c1 = c(4,4,5,5,6,4,2,2,4,6),
c2 = c(3,3,4,4,2,1,2,3,5,4),
c3 = c(4,5,4,3,5,5,3,5,5,6)
)
setD <- data.frame(
d1 = c(5,5,4,4,3,5,3,5,5,4),
d2 = c(4,4,3,3,4,3,4,3,4,5),
d3 = c(6,5,5,3,3,4,2,5,5,4)
)
I'm trying to find n number of vectors in each data frame, that have the highest correlation among each other. For this simple example, let's say want to find the n = 1 vectors in each of the k = 4 data frames, that show the overall strongest, positive correlation cor().
I'm not interested in the correlation of vectors within a data frame, but the correlation between data frames, since i wish to pick 1 variable from each set.
Intuitively, I would sum all the correlation coefficients for each combination, i.e.:
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b2, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b1, setC$c2, setC$d1)))
... # and so on...
...but this seems like brute-forcing a solution that might be solvable more elegantly, with some kind of clustering-technique?
Anyhow, I was hoping to find a dynamic solution like function(n = 1, ...) where (... for data frames) which would return a list of the highest correlating vector names.
Base on your example I would not go with a really complicated algorithm unless your actual data is huge. This is a simple approach I think gets what you want.
So base on your 4 data frames a creates the list_df and then in the function I just generate all the possible combinations of variables an calculate their correlation. At the end I select the n combinations with highest correlation.
list_df = list(setA,setB,setC,setD)
CombMaxCor = function(n = 1,list_df){
column_names = lapply(list_df,colnames)
mat_comb = expand.grid(column_names)
mat_total = do.call(cbind,list_df)
vec_cor = rep(NA,nrow(mat_comb))
for(i in 1:nrow(mat_comb)){
vec_cor[i] = sum(cor(mat_total[,as.character(unlist(mat_comb[i,]))]))
}
pos_max_temp = rev(sort(vec_cor))[1:n]
pos_max = vec_cor%in%pos_max_temp
comb_max_cor = mat_comb[pos_max,]
return(comb_max_cor)
}
You could use comb function:
fun = function(x){
nm = paste0(names(x),collapse="")
if(!grepl("(.)\\d.*\\1",nm,perl = T))
setNames(sum(cor(x)),nm)
}
unlist(combn(a,4,fun,simplify = FALSE))[1:3]#Only printed the first 3
a1b1c1d1 a1b1c1d2 a1b1c1d3
3.246442 4.097532 3.566949
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d1)))
[1] 3.246442
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d2)))
[1] 4.097532
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d3)))
[1] 3.566949
Here is a function we can use to get n non-repeating columns from each data frame to get the max total correlation:
func <- function(n, ...){
list.df <- list(...)
n.df <- length(list.df)
# 1) First get the correlations
get.two.df.cors <- function(df1, df2) apply(df1, 2,
function(x) apply(df2, 2, function(y) cor(x,y))
)
cor.combns <- lapply(list.df, function(x)
lapply(list.df, function(y) get.two.df.cors(x,y))
)
# 2) Define function to help with aggregating the correlations.
# We will call them for different combinations of selected columns from each df later
# cmbns: given a df corresponding columns to be selected each data frame
# (i-th row corresponds to i-th df),
# return the "total correlation"
get.cmbn.sum <- function(cmbns, cor.combns){
# a helper matrix to help aggregation
# each row represents which two data frames we want to get the correlation sums
df.df <- t(combn(seq(n.df), 2, c))
# convert to list of selections for each df
cmbns <- split(cmbns, seq(nrow(cmbns)))
sums <- apply(df.df, 1,
function(dfs) sum(
cor.combns[[dfs[1]]][[dfs[2]]][cmbns[[dfs[2]]], cmbns[[dfs[1]]]]
)
)
# sum of the sums give the "total correlation"
sum(sums)
}
# 3) Now perform the aggragation
# get the methods of choosing n columns from each of the k data frames
if (n==1) {
cmbns.each.df <- lapply(list.df, function(df) matrix(seq(ncol(df)), ncol=1))
} else {
cmbns.each.df <- lapply(list.df, function(df) t(combn(seq(ncol(df)), n, c)))
}
# get all unique selection methods
unique.selections <- Reduce(function(all.dfs, new.df){
all.dfs.lst <- rep(list(all.dfs), nrow(new.df))
all.new.rows <- lapply(seq(nrow(new.df)), function(x) new.df[x,,drop=F])
for(i in seq(nrow(new.df))){
for(j in seq(length(all.dfs.lst[[i]]))){
all.dfs.lst[[i]][[j]] <- rbind(all.dfs.lst[[i]][[j]], all.new.rows[[i]])
}
}
do.call(c, all.dfs.lst)
}, c(list(list(matrix(numeric(0), nrow=0, ncol=n))), cmbns.each.df))
# for each unique selection method, calculate the total correlation
result <- sapply(unique.selections, get.cmbn.sum, cor.combns=cor.combns)
return( unique.selections[[which.max(result)]] )
}
And now we have:
# n = 1
func(1, setA, setB, setC, setD)
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
# [4,] 2
# n = 2
func(2, setA, setB, setC, setD)
# [,1] [,2]
# [1,] 1 2
# [2,] 2 3
# [3,] 2 3
# [4,] 2 3

How to perform the same t-test in a for loop?

I have a database with columns theme (value 0 or 1), level (value 1 to 9) and startTime (double value). For every level, I want to perform a t-test on the startTime values. Here is my code:
database <- read.csv("database.csv")
themeData <- database[database$theme == 1, ]
noThemeData <- database[database$theme == 0, ]
for (i in 1:9) {
x <- themeData[themeData$level == i, ]
y <- noThemeData[noThemeData$level == i, ]
t.test(x$startTime,y$startTime,
alternative = "less")
}
Unfortunately, no t-tests are being executed. In the end, x and y simply get the value for i=9. What am I doing wrong?
Your code is doing busy work: it is doing the calculations of the t.test, but since for loops always discard their implied results, you aren't storing it anywhere. You would have had to use a vector or list (pre-allocated is always better) like so:
res <- replicate(9, NULL)
for (i in 1:9) {
x <- themeData[themeData$level == i, ]
y <- noThemeData[noThemeData$level == i, ]
res[[i]] <- t.test(x$startTime,y$startTime,
alternative = "less")
}
res[[2]]
This can be "good enough" in that it is saving all test "results objects" in a list for later processing/consumption. A slightly better method is to use one of the *apply functions; the first two I think of that are directly applicable here (lapply, sapply(..., simplify=FALSE)) have various minor advantages, frankly you can choose either.
res <- lapply(c(4, 6, 8), function(thiscyl) {
am0 <- subset(mtcars, am == 0 & cyl == thiscyl)
am1 <- subset(mtcars, am == 1 & cyl == thiscyl)
t.test(am0$mpg, am1$mpg)
})
This is especially beneficial if (unlike here) the tests take a long time: you perform the test and preserve the models, so you can so lots of things to the results without having to rerun the tests. For instance, if you wanted just the p-values:
sapply(res, `[`, "p.value")
# $p.value
# [1] 0.01801712
# $p.value
# [1] 0.187123
# $p.value
# [1] 0.7038727
or more tersely:
sapply(res, `[[`, "p.value")
# [1] 0.01801712 0.18712303 0.70387268
Another example, the confidence intervals, in a matrix:
t(sapply(res, `[[`, "conf.int"))
# [,1] [,2]
# [1,] -9.232108 -1.117892
# [2,] -3.916068 1.032735
# [3,] -2.339549 1.639549
You can always look at a single model with, say, res[[2]], but if you need to see all of them you can use just res and see the whole gamut.
res[[2]]
# Welch Two Sample t-test
# data: am0$mpg and am1$mpg
# t = -1.5606, df = 4.4055, p-value = 0.1871
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -3.916068 1.032735
# sample estimates:
# mean of x mean of y
# 19.12500 20.56667

Multiple ttests

I want to perform multiple ttests on data in the following format
first column is "id"
with values (for example) 1,1,1,2,2,2
second column is "ratios"
with values 0.2, 0.18, 0.3, 1.5, 1.4, 1.6
for each instance of "id" I want to test all ratio values against all the ratio values in the dataframe
Right now I have this
data <- read.delim("clipboard", stringsAsFactors=FALSE) ##data to test
dist <- as.numeric(readClipboard()) ##distribution to test against
data$Ratio.Mean.H.L <- NA
data$p.value <- NA
for (i in 1:nrow(data))
if (nrow(data) > 1)
{
#welsh t-test
t.test.result <- t.test(data$ratio[i],dist,
alternative = "two.sided",
mu = 0,
paired = FALSE,
var.equal = FALSE,
conf.level = 0.95)
#writes data into the data.frame
data$p.value[i] <- t.test.result$p.value
}
write.table(data, file="C:/R_Temp/t-test.txt", sep = "\t")
I know this does not work, for one I am not sure I am only testing rows that share the same "id". I am also manually entering the distribution to test against, which is all entries in the "ratio" column.
How do I do this correct? and add multiple testing correction (bonferroni)?
I suspect that MattParker's comment is going to be the biggest thing here: you are comparing a single number with a vector, and t.test will complain about that. Since you suggested that you want to perform tests per grouping variable (id), so in base R you probably want to use a function like by (or split). (There are great methods within dplyr and data.table as well.)
Using mtcars as sample data, I'll try to mimic your data:
dat <- mtcars[c("cyl", "mpg")]
colnames(dat) <- c("id", "ratio")
It isn't clear what you mean to use for dist, so I'll use the naïve
dist <- 1:10
Now you can do:
by(dat$ratio, dat$id, function(x) t.test(x, dist, paired = FALSE)$p.value)
# dat$id: 4
# [1] 2.660716e-10
# ------------------------------------------------------------
# dat$id: 6
# [1] 4.826322e-09
# ------------------------------------------------------------
# dat$id: 8
# [1] 2.367184e-07
If you want/need to deal with more than just ratio at a time, you can alternatively do this:
by(dat, dat$id, function(x) t.test(x$ratio, dist, paired = FALSE)$p.value)
# dat$id: 4
# [1] 2.660716e-10
# ------------------------------------------------------------
# dat$id: 6
# [1] 4.826322e-09
# ------------------------------------------------------------
# dat$id: 8
# [1] 2.367184e-07
The results from the call to by are a class "by", which is really just a repackaged list with some extra attributes:
res <- by(dat, dat$id, function(x) t.test(x$ratio, dist, paired = FALSE)$p.value)
class(res)
# [1] "by"
str(attributes(res))
# List of 4
# $ dim : int 3
# $ dimnames:List of 1
# ..$ dat$id: chr [1:3] "4" "6" "8"
# $ call : language by.data.frame(data = dat, INDICES = dat$id, FUN = function(x) t.test(x$ratio, dist, paired = FALSE)$p.value)
# $ class : chr "by"
So you can expand/access it however you would a list:
res[[1]]
# [1] 2.660716e-10
as.numeric(res)
# [1] 2.660716e-10 4.826322e-09 2.367184e-07
names(res)
# [1] "4" "6" "8"
(Realize that the different levels of dat$id are the integers 4, 6, and 8, so the names should correspond to your $id.)
Edit:
If you want the results in a data.frame, two options come to mind:
Repeat the p-value for each and every row, resulting in a lot of duplication. I discourage this method for several reasons; if you need it at some point, I suggest using option 2 and then merge.
Produce a data.frame with as many rows as unique id. Something like:
do.call(rbind.data.frame,
by(dat, dat$id, function(x) list(id=x$id[1], pv=t.test(x, dist, paired=F)$p.value)))
# id pv
# 4 4 1.319941e-03
# 6 6 2.877065e-03
# 8 8 6.670216e-05
OK, Sorry for the poorly defined question. I got help elsewhere and will post the script that worked for those that are interested. I want to calculate p-values for ratio changes in a proteomics experiment. To do this I make individual t-tests for all the ratio measurements for any given protein or PTM site.These measurements are compared to the median of all measurments (mu in the t.test function), or to the entire distribution of measurements. In one column I have "id"s which are unique for each entry, in the other column I have "values" (ratios). I will make t-tests comparing all "values" that occur with any given unique "id". for ease of use I paste the table into the script, rather than calling it from a file (it saves me a step).
data <- read.delim("clipboard", stringsAsFactors=FALSE) ##data to test(two columns "id" and "value") Log-transfrom ratios!!
summary(data)
med <- median(data$value)
# function for the id-grouped t-test
calc_id_ttest <- function(d) #col1: id, col2:values
{
colnames(d) <- c("id", "value") # reassign the column names
# calculate the number of values for each id
res_N <- as.data.frame(tapply(d$value, d$id, length))
colnames(res_N) <- "N"
res_N$id <- row.names(res_N)
# calculate the number of values for each id
res_med <- as.data.frame(tapply(d$value, d$id, median))
colnames(res_med) <- "med"
res_med$id <- row.names(res_med)
# calculate the pvalues
res_pval <- as.data.frame(tapply(d$value, d$id, function(x)
{
if(length(x) < 3)
{ # t test requires at least 3 samples
NA
}
else
{
t.test(x, mu=med)$p.value #t.test (Pearson)d$value with other distribution? alternative=less or greater
} #d$value to compare with entire distribution
#mu=med for median of values for 1-sided test
}))
colnames(res_pval) <- "pval" # nominal p value
res_pval$id <- row.names(res_pval)
res_pval$adj.pval <- p.adjust(res_pval$pval, method = "BH") #multiple testing correction also "bonferroni"
res <- Reduce(function(x,y)
{
merge(x,y, by = "id", all = TRUE)
},
list(res_N, res_med, res_pval))
return (res)
}
data_result <- calc_id_ttest(d = data)
write.table(data_result, file="C:/R_Temp/t-test.txt", quote = FALSE, row.names = FALSE, col.names = TRUE, sep = "\t")

Computing pairwise Hamming distance between all rows of two integer matrices/data frames

I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second best) matching row to df1 in terms of hamming distance.
I used e1071 package to compute hamming distance. Hamming distance between two vectors x and y can be computed as for example:
x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
24197, 610187, 402471, 157122, 866381, 582868, 878)
y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)
xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)
distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))
and the resulting distance is 25. Yet I need to do this for all rows of df1 and df2. A trivial method takes a double loop nest and looks terribly slow.
Any ideas how to do this more efficiently? In the end I need to append to df2:
a column with the row id from df1 that gives the lowest distance;
a column with the lowest distance;
a column with the row id from df1 that gives the 2nd lowest distance;
a column with the second lowest distance.
Thanks.
Fast computation of hamming distance between two integers vectors of equal length
As I said in my comment, we can do:
hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))
to compute hamming distance between two integers vectors of equal length x and y. This only uses R base, yet is more efficient than e1071::hamming.distance, because it is vectorized!
For the example x and y in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)
Fast hamming distance between a matrix and a vector
If we want to compute the hamming distance between a single y and multiple xs, i.e., the hamming distance between a vector and a matrix, we can use the following function.
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
Note that:
hmd performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;
there is no obvious loop here; instead, we use tapply().
Fast hamming distance computation between two matrices/data frames
This is what you want. The following function foo takes two data frames or matrices df1 and df2, computing the distance between df1 and each row of df2. argument p is an integer, showing how many results you want to retain. p = 3 will keep the smallest 3 distances with their row ids in df1.
foo <- function(df1, df2, p) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
k <- 1:p
for (i in 1:n) {
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
k <- k + p
}
## recode "id" and "d" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
list(id = id, d = d)
}
Note that:
transposition is done at the beginning, according to reasons before;
a for loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using *apply family, since we ask for multiple output (row id id and distance d).
Experiment
This part uses small dataset to test/demonstrate our functions.
Some toy data:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
Test hmd first (needs transposition):
hmd(t(as.matrix(df1)), df2[1, ]) ## df1 & first row of df2
# [1] 2 4 6 2 4
Test foo:
foo(df1, df2, p = 2)
# $id
# min1 min2
# 1 1 4
# 2 2 3
# 3 5 2
# $d
# mindist.1 mindist.2
# 1 2 2
# 2 1 3
# 3 1 3
If you want to append some columns to df2, you know what to do, right?
Please don't be surprised why I take another section. This part gives something relevant. It is not what OP asks for, but may help any readers.
General hamming distance computation
In the previous answer, I start from a function hmd0 that computes hamming distance between two integer vectors of the same length. This means if we have 2 integer vectors:
set.seed(0)
x <- sample(1:100, 6)
y <- sample(1:100, 6)
we will end up with a scalar:
hmd0(x,y)
# 13
What if we want to compute pairwise hamming distance of two vectors?
In fact, a simple modification to our function hmd will do:
hamming.distance <- function(x, y, pairwise = TRUE) {
nx <- length(x)
ny <- length(y)
rawx <- intToBits(x)
rawy <- intToBits(y)
if (nx == 1 && ny == 1) return(sum(as.logical(xor(intToBits(x),intToBits(y)))))
if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
bits <- length(intToBits(0)) ## 32-bit or 64 bit?
result <- unname(tapply(as.logical(xor(rawx,rawy)), rep(1:ny, each = bits), sum))
}
if (pairwise) result else sum(result)
}
Now
hamming.distance(x, y, pairwise = TRUE)
# [1] 0 3 3 2 5 0
hamming.distance(x, y, pairwise = FALSE)
# [1] 13
Hamming distance matrix
If we want to compute the hamming distance matrix, for example,
set.seed(1)
x <- sample(1:100, 5)
y <- sample(1:100, 7)
The distance matrix between x and y is:
outer(x, y, hamming.distance) ## pairwise argument has no effect here
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 2 3 4 3 4 4 2
# [2,] 7 6 3 4 3 3 3
# [3,] 4 5 4 3 6 4 2
# [4,] 2 3 2 5 6 4 2
# [5,] 4 3 4 3 2 0 2
We can also do:
outer(x, x, hamming.distance)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 5 2 2 4
# [2,] 5 0 3 5 3
# [3,] 2 3 0 2 4
# [4,] 2 5 2 0 4
# [5,] 4 3 4 4 0
In the latter situation, we end up with a symmetric matrix with 0 on the diagonal. Using outer is inefficient here, but it is still more efficient than writing R loops. Since our hamming.distance is written in R code, I would stay with using outer. In my answer to this question, I demonstrate the idea of using compiled code. This of course requires writing a C version of hamming.distance, but I will not show it here.
Here's an alternative solution that uses only base R, and should be very fast, especially when your df1 and df2 have many rows. The main reason for this is that it does not use any R-level looping for calculating the Hamming distances, such as for-loops, while-loops, or *apply functions. Instead, it uses matrix multiplication for computing the Hamming distance. In R, this is much faster than any approach using R-level looping. Also note that using an *apply function will not necessarily make your code any faster than using a for loop. Two other efficiency-related features of this approach are: (1) It uses partial sorting for finding the best two matches for each row in df2, and (2) It stores the entire bitwise representation of df1 in one matrix (same for df2), and does so in one single step, without using any R-level loops.
The function that does all the work:
# INPUT:
# X corresponds to your entire df1, but is a matrix
# Y corresponds to your entire df2, but is a matrix
# OUTPUT:
# Matrix with four columns corresponding to the values
# that you specified in your question
fun <- function(X, Y) {
# Convert integers to bits
X <- intToBits(t(X))
# Reshape into matrix
dim(X) <- c(ncols * 32, nrows)
# Convert integers to bits
Y <- intToBits(t(Y))
# Reshape into matrix
dim(Y) <- c(ncols * 32, nrows)
# Calculate pairwise hamming distances using matrix
# multiplication.
# Columns of H index into Y; rows index into X.
# The code for the hamming() function was retrieved
# from this page:
# https://johanndejong.wordpress.com/2015/10/02/faster-hamming-distance-in-r-2/
H <- hamming(X, Y)
# Now, for each row in Y, find the two best matches
# in X. In other words: for each column in H, find
# the two smallest values and their row indices.
t(apply(H, 2, function(h) {
mindists <- sort(h, partial = 1:2)
c(
ind1 = which(h == mindists[1])[1],
val1 = mindists[1],
hmd2 = which(h == mindists[2])[1],
val2 = mindists[2]
)
}))
}
To call the function on some random data:
# Generate some random test data with no. of columns
# corresponding to your data
nrows <- 1000
ncols <- 26
# X corresponds to your df1
X <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
# Y corresponds to your df2
Y <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
res <- fun(X, Y)
The above example with 1000 rows in both X (df1) and Y (df2) took about 1.1 - 1.2 seconds to run on my laptop.

Resources