Similarity / distance between many pairs of matrices - r

I want to quantify group similarity by computing the mean of the distance between all sets of (multidimensional) points in each pair.
I can do this easily enough manually for each pair of groups manually like so:
library(dplyr)
library(tibble)
library(proxy)
# dummy data
set.seed(123)
df1 <- data.frame(x = rnorm(100,0,4),
y = rnorm(100,1,5),
z = rbinom(100, 1, 0.1))
df2 <- data.frame(x = rnorm(100,-1,3),
y = rnorm(100,0,6),
z = rbinom(100, 1, 0.1))
df3 <- data.frame(x = rnorm(100,-30,4),
y = rnorm(100,10,2),
z = rbinom(100, 1, 0.9))
# compute distance (unscaled, uncentred data)
dist(df1, df2, method = "gower") %>% mean
dist(df1, df3, method = "gower") %>% mean
dist(df2, df3, method = "gower") %>% mean
But I'd like to somehow vectorise this as my actual data has 30+ groups. A simple for loop can achieve this like so:
# combine data and scale, centre
df <- rbind(df1, df2, df3) %>%
mutate(id = rep(1:3, each = 100))
df <- df %>%
select(-id) %>%
transmute_all(scale) %>%
add_column(id = df$id)
# create empty matrix for comparisons
n <- df$id %>% unique %>% length
m <- matrix(nrow = n, ncol = n)
# loop through each pair once
for(i in 1:n) {
for(j in 1:i) { #omit top right corner
if(i == j) {
m[i,j] <- NA #omit diagonal
} else {
m[i,j] <- dist(df[df$id == i,1:3], df[df$id == j,1:3], method = "gower") %>% mean
}
}
}
m
[,1] [,2] [,3]
[1,] NA NA NA
[2,] 0.2217443 NA NA
[3,] 0.8446070 0.8233932 NA
However, this method scales predictably badly; a quick benchmark suggests this will take 90+ hours with my actual data which has 30+ groups with 1000+ rows per group.
Can anyone suggest a more efficient solution, or perhaps a fundamentally different way to frame the problem which I'm missing?

I'm not sure if this will do well but here's another approach. You use ls to obtain the names of matrices, combn to generate pairs of two, and then get to obtain the matrices for calculating dist
do.call(rbind,
combn(ls(pattern = "df\\d+"), 2, FUN = function(x)
data.frame(pair = toString(x),
dist = mean(dist(get(x[1]), get(x[2]), method = "gower")),
stringsAsFactors = FALSE),
simplify = FALSE
))
# pair dist
#1 df1, df2 0.2139304
#2 df1, df3 0.8315169
#3 df2, df3 0.8320911

You could take each pair of groups, concatenate them, and then just calculate the dissimilarity matrix within that group. Obviously this means you're comparing a group to itself to an extent, but it may still work for your use case, and with daisy it is reasonably quick for your size of data.
library(cluster)
n <- 30
groups <- vector("list", 30)
# dummy data
set.seed(123)
for(i in 1:30) {
groups[[i]] = data.frame(x = rnorm(1000,ceiling(runif(1, -10, 10)),ceiling(runif(1, 2, 4))),
y = rnorm(1000,ceiling(runif(1, -10, 10)),ceiling(runif(1, 2, 4))),
z = rbinom(1000,1,runif(1,0.1,0.9)))
}
m <- matrix(nrow = n, ncol = n)
# loop through each pair once
for(i in 1:n) {
for(j in 1:i) { #omit top right corner
if(i == j) {
m[i,j] <- NA #omit diagonal
} else {
# concatenate groups
dat <- rbind(df_list[[i]], df_list[[j]])
# compute all distances (between groups and within groups), return matrix
mm <- dat %>%
daisy(metric = "gower") %>%
as.matrix
# retain only distances between groups
mm <- mm[(nrow(df_list[[i]])+1):nrow(dat) , 1:nrow(df_list[[i]])]
# write mean distance to global comparison matrix
m[i,j] <- mean(mm)
}
}
}

proxy can work with lists of matrices as input,
you only need to define a wrapper function that does what you want:
nested_gower <- function(x, y, ...) {
mean(proxy::dist(x, y, ..., method = "gower"))
}
proxy::pr_DB$set_entry(
FUN = nested_gower,
names = c("ngower"),
distance = TRUE,
loop = TRUE
)
df_list <- list(df1, df2, df3)
proxy::dist(df_list, df_list, method = "ngower")
[,1] [,2] [,3]
[1,] 0.1978306 0.2139304 0.8315169
[2,] 0.2139304 0.2245903 0.8320911
[3,] 0.8315169 0.8320911 0.2139049
This will still be slow,
but it should be faster than for loops in plain R
(proxy uses C in the background).
Important: note that the diagonal of the resulting cross-distance matrix doesn't have zeros.
If you were to call dist like proxy::dist(df_list, method = "ngower"),
proxy will assume that distance(x, y) = distance(y, x) (symmetry),
and that distance(x, x) = 0,
the latter of which is not true in this case.
Passing two arguments to dist prevents this assumption.
If you really don't care about the diagonal,
pass only one argument to save some extra time by avoiding the calculations of the upper triangular.
Alternatively, if you do care about the diagonal but still want to avoid calculating the upper triangular,
call dist first with one argument and then call proxy::dist(df_list, df_list, method = "ngower", pairwise = TRUE).
Side note: if you want to imitate this behavior with the gower package (as suggested by d.b),
you could define the wrapper function as:
nested_gower <- function(x, y, ...) {
distmat <- sapply(seq_len(nrow(y)), function(y_row) {
gower::gower_dist(x, y[y_row, , drop = FALSE], ...)
})
mean(distmat)
}
However, the values returned seem to change depending on how many records are passed to the functions,
so it's hard to tell what would be the best approach.
*Use proxy::pr_DB$delete_entry("ngower") first if you want to redefine a function in proxy.
If you prefer proxy's version of the Gower cross-distance matrix,
it occurs to me that you could leverage some of the functionality of my dtwclust package to do the calculations in parallel:
library(dtwclust)
library(doParallel)
custom_dist <- new("tsclustFamily", dist = "ngower", control = list(symmetric = TRUE))#dist
workers <- makeCluster(detectCores())
registerDoParallel(workers)
distmat <- custom_dist(df_list)
stopCluster(workers); registerDoSEQ()
This might be faster for your actual use case
(not so much for the small sample data here).
Same caveat about the diagonal
(so use custom_dist(df_list, df_list) or custom_dist(df_list, pairwise = TRUE)).
See section 3.2 here and the documentation of tsclustFamily if you'd like more info.

Related

Unnest/Unlist moving window results in R

I have a dataframe that has two columns, x and y (both populated with numbers). I am trying to look at a moving window within the data, and I've done it like this (source):
# Extract just x and y from the original data frame
df <- dat_fin %>% select(x, y)
# Moving window creation
nr <- nrow(df)
windowSize <- 10
windfs <- lapply(seq_len(nr - windowSize + 1), function(i) df[i:(i + windowSize - 1), ])
This lapply creates a list of tibbles that are each 10 (x, y) pairs. At this point, I am trying to compute a single quantity using each of the sets of 10 pairs; my current (not working) code looks like this:
library(shotGroups)
for (f in 1:length(windfs)) {
tsceps[f] = getCEP(windfs[f], accuracy = TRUE)
}
When I run this, I get the error:
Error in getCEP.default(windfs, accuracy = TRUE) : xy must be numeric
My goal is that the variable that I've called tsceps should be a 1 x length(windfs) data frame, each value in which comes from the getCEP calculation for each of the windowed subsets.
I've tried various things with unnest and unlist, all of which were unsuccessful.
What am I missing?
Working code:
df <- dat_fin %>% select(x, y)
nr <- nrow(df)
windowSize <- 10
windfs <- lapply(seq_len(nr - windowSize + 1), function(i) df[i:(i + windowSize - 1), ])
tsceps <- vector(mode = "numeric", length = length(windfs))
library(shotGroups)
for (j in 1:length(windfs)) {
tsceps[j] <- getCEP(windfs[[j]], type = "CorrNormal", CEPlevel = 0.50, accuracy = TRUE)
}
ults <- unlist(tsceps)
ults_cep <- vector(mode = "numeric", length = length(ults))
for (k in 1:length(ults)) {
ults_cep[k] <- ults[[k]]
}
To get this working with multiple type arguments to getCEP, just use additional code blocks for each type required.

Is there a quick way in R to predict the class outcome of an observation from a nearest neighbours model from RANN?

I am trying to identify the most probable group that an observation belongs to, for several thousand large datasets. It is possible that some of the data is incorrectly classified and I am trying to work out the most likely "true" value. I have tried to use knn3 from the caret package but the predictions take too long to compute. In researching alternatives I have came across the nn2 function from RANN package which performs a nearest neighbour search that is significantly faster than K-Nearest Neighbours.
library(RANN)
library(tidyverse)
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
The result on the nn2 function is two lists, one of indices and one of distances. I want to use the indices table to work out the most likely grouping of each observation, however it returns the row number of the observation and not it's group. I need to replace this with the group it belongs to (in this case, the species column).
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
I have removed the first column as the first nearest neighbour is always the observation itself.
matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
This code gives me the output I want, but is there a tidier way of creating this table and then calculating the most common response for each row, with the speed of calculation being the key.
Your scaling can be a real bottleneck when you have more columns (tested on 200 x 22216 gene expression matrix). My version might not seem that impressive with the iris dataset, but on the larger dataset I get 1.3 sec vs. 32.8 sec execution time.
Using tabulate instead of table gives an additional improvement, which is dwarfed, however, by the matrix scaling.
I used a custom scale function here, but using base::scale on a matrix would already be a major improvement.
I also addressed the issue raised by M. Papenberg of "self" not being considered the nearest neighbor by setting those to NA.
invisible(lapply(c("tidyverse", "matrixStats", "RANN", "microbenchmark", "compiler"),
require, character.only=TRUE))
enableJIT(3)
# faster column scaling (modified from https://www.r-bloggers.com/author/strictlystat/)
colScale <- function(x, center = TRUE, scale = TRUE, rows = NULL, cols = NULL) {
if (!is.null(rows) && !is.null(cols)) {x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) x <- x[, cols, drop = FALSE]
cm <- colMeans(x, na.rm = TRUE)
if (scale) csd <- matrixStats::colSds(x, center = cm, na.rm = TRUE) else
csd <- rep(1, length = length(cm))
if (!center) cm <- rep(0, length = length(cm))
x <- t((t(x) - cm) / csd)
return(x)
}
# your posted version (mostly):
oldv <- function(){
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
category_neighbours <- matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
class <- apply(category_neighbours, 1, function(x) {
x1 <- table(x)
names(x1)[which.max(x1)]})
cbind(iris, class)
}
## my version:
myv <- function(){
iris.scaled <- colScale(data.matrix(iris[, 1:(dim(iris)[2]-1)]))
iris.nn2 <- nn2(iris.scaled)
# set self neighbors to NA
iris.nn2$nn.idx[iris.nn2$nn.idx - seq_len(dim(iris.nn2$nn.idx)[1]) == 0] <- NA
# match up categories
category_neighbours <- matrix(iris$Species[iris.nn2$nn.idx[,]],
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
# turn category_neighbours into numeric for tabulate
cn <- matrix(as.numeric(factor(category_neighbours, exclude=NULL)),
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
cnl <- levels(factor(category_neighbours, exclude = NULL))
# tabulate frequencies and match up with factor levels
class <- apply(cn, 1, function(x) {
cnl[which.max(tabulate(x, nbins=length(cnl))[!is.na(cnl)])]})
cbind(iris, class)
}
microbenchmark(oldv(), myv(), times=100L)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> oldv() 11.015986 11.679337 12.806252 12.064935 12.745082 33.89201 100 b
#> myv() 2.430544 2.551342 3.020262 2.612714 2.691179 22.41435 100 a

Create a matrix from a list consisting of unequal matrices for individual bootstraps

I tried to create a matrix from a list which consists of N unequal matrices...
The reason to do this is to make R individual bootstrap samples.
In the example below you can find e.g. 2 companies, where we have 1 with 10 & 1 with just 5 observations.
Data:
set.seed(7)
Time <- c(10,5)
xv <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2);
y <- matrix( c(rnorm(10,5,2), rnorm(5,20,1)));
z <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2)
# create data frame of input variables which helps
# to conduct the rowise bootstrapping
data <- data.frame (y = y, xv = xv, z = z);
rows <- dim(data)[1];
cols <- dim(data)[2];
# create the index to sample from the different panels
cumTime <- c(0, cumsum (Time));
index <- findInterval (seq (1:rows), cumTime, left.open = TRUE);
# draw R individual bootstrap samples
bootList <- replicate(R = 5, list(), simplify=F);
bootList <- lapply (bootList, function(x) by (data, INDICES = index, FUN = function(x) dplyr::sample_n (tbl = x, size = dim(x)[1], replace = T)));
---------- UNLISTING ---------
Currently, I try do it incorrectly like this:
Example for just 1 entry of the list:
matrix(unlist(bootList[[1]], recursive = T), ncol = cols)
The desired output is just
bootList[[1]]
as a matrix.
Do you have an idea how to do this & if possible reasonably efficient?
The matrices are then processed in unfortunately slow MLE estimations...
i found a solution for you. From what i gather, you have a Dataframe containing all observations of all companies, which may have different panel lengths. And as a result you would like to have a Bootstap sample for each company of same size as the original panel length.
You mearly have to add a company indicator
data$company = c(rep(1, 10), rep(2, 5)) # this could even be a factor.
L1 = split(data, data$company)
L2 = lapply(L1, FUN = function(s) s[sample(x = 1:nrow(s), size = nrow(s), replace = TRUE),] )
stop here if you would like to have saperate bootstap samples e.g. in case you want to estimate seperately
bootdata = do.call(rbind, L2)
Best wishes,
Tim

Apply function or Loop in R: Not numerical, returning NA

I am working with a resampling procedure in R (just like a bootstrap). I have a matrix of response/explanatory variables and would like to make 999 samples of this matrix to calculate for each statistic I am working their mean, sd and confidence interval. So, I wrote a function to calculate and to return a list:
mydata <- data.frame(a=rnorm(20, 1, 1), b = rnorm(20,1,1))
myfun <- function(data, n){
sample <- data[sample(n, replace = T),]
model1 <- lm(sample[,1]~sample[,2])
return(list(model1[[1]][[1]], model1[[1]][[2]]))
}
result <- as.numeric()
result <- replicate(99, myfun(mydata, 10))
Then, I have a matrix as my output in which the rows are the statistics and the columns are the samplings (nrow = 2 and ncol = 99). I need the mean and sd for each row, but when I try to use the apply function or even a loop the following message shows up:
In mean.default(newX[, i], ...) :
argument is not numeric or logical: returning NA
Moreover:
is.numeric(result)
[1] FALSE
I found it strange, because I never had such problem with similar procedures.
Any thoughts?
Use the following:
myfun <- function(dat, n){
dat1 <- dat[sample(n, replace = T),]
model1 <- lm(dat1[,1] ~ dat1[,2])
return(coef(model1))
}
replicate(99, myfun(mydata, 10))
The reason is the 'result' is a list of 198 elements with dimension attributes. We need to unlist the 'result' and provide the dimension attributes
result1 <- `dim<-`(unlist(result), dim(result))
and then use the apply
Just replace list() by c() in your myfun() function
mydata <- data.frame(a=rnorm(20, 1, 1), b = rnorm(20,1,1))
myfun <- function(data, n){
sample <- data[sample(n, replace = T),]
model1 <- lm(sample[,1]~sample[,2])
return(c(model1[[1]][[1]], model1[[1]][[2]]))
}
result <- as.numeric()
result <- replicate(99, myfun(mydata, 10))
apply(result, FUN=mean, 1)
apply(result, FUN=sd, 1)
This worked for me:
mydata <- data.frame(a=rnorm(20, 1, 1), b = rnorm(20,1,1))
myfun <- function(data, n){
sample <- data[sample(n, replace = T),]
model1 <- lm(sample[,1]~sample[,2])
return(data.frame(v1 = model1[[1]][[1]], v2 = model1[[1]][[2]]))
}
result <- do.call("rbind",(replicate(99, myfun(mydata, 10), simplify = FALSE)))

How to extract the p.value and estimate from cor.test() in a data.frame?

In this example, I have temperatures values from 50 different sites, and I would like to correlate the Site1 with all the 50 sites. But I want to extract only the components "p.value" and "estimate" generated with the function cor.test() in a data.frame into two different columns.
I have done my attempt and it works, but I don't know how!
For that reason I would like to know how can I simplify my code, because the problem is that I have to run two times a Loop "for" to get my results.
Here is my example:
# Temperature data
data <- matrix(rnorm(500, 10:30, sd=5), nrow = 100, ncol = 50, byrow = TRUE,
dimnames = list(c(paste("Year", 1:100)),
c(paste("Site", 1:50))) )
# Empty data.frame
df <- data.frame(label=paste("Site", 1:50), Estimate="", P.value="")
# Extraction
for (i in 1:50) {
df1 <- cor.test(data[,1], data[,i] )
df[,2:3] <- df1[c("estimate", "p.value")]
}
for (i in 1:50) {
df1 <- cor.test(data[,1], data[,i] )
df[i,2:3] <- df1[c("estimate", "p.value")]
}
df
I will appreciate very much your help :)
I might offer up the following as well (masking the loops):
result <- do.call(rbind,lapply(2:50, function(x) {
cor.result<-cor.test(data[,1],data[,x])
pvalue <- cor.result$p.value
estimate <- cor.result$estimate
return(data.frame(pvalue = pvalue, estimate = estimate))
})
)
First of all, I'm guessing you had a typo in your code (you should have rnorm(5000 if you want unique values. Otherwise you're going to cycle through those 500 numbers 10 times.
Anyway, a simple way of doing this would be:
data <- matrix(rnorm(5000, 10:30, sd=5), nrow = 100, ncol = 50, byrow = TRUE,
dimnames = list(c(paste("Year", 1:100)),
c(paste("Site", 1:50))) )
# Empty data.frame
df <- data.frame(label=paste("Site", 1:50), Estimate="", P.value="")
estimates = numeric(50)
pvalues = numeric(50)
for (i in 1:50){
test <- cor.test(data[,1], data[,i])
estimates[i] = test$estimate
pvalues[i] = test$p.value
}
df$Estimate <- estimates
df$P.value <- pvalues
df
Edit: I believe your issue was is that in the line df <- data.frame(label=paste("Site", 1:50), Estimate="", P.value="") if you do typeof(df$Estimate), you see it's expecting an integer, and typeof(test$estimate) shows it spits out a double, so R doesn't know what you're trying to do with those two values. you can redo your code like thus:
df <- data.frame(label=paste("Site", 1:50), Estimate=numeric(50), P.value=numeric(50))
for (i in 1:50){
test <- cor.test(data[,1], data[,i])
df$Estimate[i] = test$estimate
df$P.value[i] = test$p.value
}
to make it a little more concise.
similar to the answer of colemand77:
create a cor function:
cor_fun <- function(x, y, method){
tmp <- cor.test(x, y, method= method)
cbind(r=tmp$estimate, p=tmp$p.value) }
apply through the data.frame. You can transpose the result to get p and r by row:
t(apply(data, 2, cor_fun, data[, 1], "spearman"))

Resources