Calculating a distance matrix by dtw - r

I have two matrices of normalized read counts for control and treatment in a time series day1 to day26. I want to calculate distance matrix by Dynamic Time Wrapping afterward use that for clustering but seems too complicated. I did so; who can help for more clarification please? Thanks a lot
> head(control[,1:4])
MAST2 WWC2 PHYHIPL R3HDM2
Control_D1 6.591024 5.695156 3.388652 5.756384
Control_D1 8.043454 5.365221 6.859768 6.936970
Control_D3 7.731590 4.868267 6.919972 6.931073
Control_D4 8.129948 5.105528 6.627016 7.090268
Control_D5 7.690863 4.729501 6.824746 6.904610
Control_D6 8.101723 5.334501 6.868990 7.115883
>
> head(lead[,1:4])
MAST2 WWC2 PHYHIPL R3HDM2
Lead30_D1 6.418423 5.610699 3.734425 5.778046
Lead30_D2 7.918360 4.295191 6.559294 6.780952
Lead30_D3 7.807142 4.294722 6.599187 6.716040
Lead30_D4 7.856720 4.432136 6.572337 6.848483
Lead30_D5 7.827311 4.204738 6.607107 6.784094
Lead30_D6 7.848760 4.458451 6.581216 6.943003
>
> dim(control)
[1] 26 2603
> dim(lead)
[1] 26 2603
library(dtw)
for (i in control) {
for (j in lead) {
result[i,j] <- dtw( dist(control[,,i],lead[,,j]), distance.only=T )$normalizedDistance
}
}
Says that
Error in lead[, , j] : incorrect number of dimensions

There have already been questions similar to yours,
but the answers haven't been too detailed.
Here's a breakdown of what you need to know,
in the specific case of R.
Calculating cross-distance matrices
The proxy package is made specifically for the calculation of cross-distance matrices.
You should check its vignette to know which measures are already implemented by it.
An example of its use:
set.seed(1L)
sample_data <- matrix(rnorm(50L), nrow = 5L, ncol = 10L)
suppressPackageStartupMessages(library(proxy))
distance_matrix <- proxy::dist(sample_data, method = "euclidean",
upper = TRUE, diag = TRUE)
print(distance_matrix)
#> 1 2 3 4 5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000
Note: in the context of time series,
proxy treats each row in a matrix as a series,
which can be confirmed by the fact that sample_data above is a 5x10 matrix and the resulting cross-distance matrix is 5x5.
Using the DTW distance
The dtw package implements many variations of DTW,
and it also leverages proxy.
You could calculate a DTW distance matrix with:
suppressPackageStartupMessages(library(dtw))
dtw_distmat <- proxy::dist(sample_data, method = "dtw",
upper = TRUE, diag = TRUE)
print(distance_matrix)
#> 1 2 3 4 5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000
Using custom distances
One nice thing about proxy is that it gives you the option to register custom functions.
You seem to be interested in the normalized version of DTW,
so you could do something like this:
ndtw <- function(x, y = NULL, ...) {
dtw::dtw(x, y, ..., distance.only = TRUE)$normalizedDistance
}
pr_DB$set_entry(
FUN = ndtw,
names = "ndtw",
loop = TRUE,
distance = TRUE
)
ndtw_distmat <- proxy::dist(sample_data, method = "ndtw",
upper = TRUE, diag = TRUE)
print(ndtw_distmat)
#> 1 2 3 4 5
#> 1 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> 2 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> 3 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> 4 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> 5 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000
See the documentation of pr_DB for more information.
Other DTW implementations
The dtwclust package
(which I made)
implements a basic but faster version of DTW which can use multi-threading and also leverages proxy:
suppressPackageStartupMessages(library(dtwclust))
dtw_basic_distmat <- proxy::dist(sample_data, method = "dtw_basic", normalize = TRUE)
print(dtw_basic_distmat)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> [2,] 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> [3,] 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> [4,] 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> [5,] 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000
The dtw_basic implementation only supports two step patterns and one window type,
but it is considerably faster:
suppressPackageStartupMessages(library(microbenchmark))
microbenchmark(
proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba", window.size = 5L),
proxy::dist(sample_data, method = "dtw_basic", window.size = 5L)
)
Unit: microseconds
expr min lq mean
proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba", window.size = 5L) 5279.124 5621.742 6070.069
proxy::dist(sample_data, method = "dtw_basic", window.size = 5L) 657.966 710.418 776.474
median uq max neval cld
5802.354 6348.199 10411.000 100 b
752.282 814.037 1161.626 100 a
Another multi-threaded implementation is included in the parallelDist package,
although I haven't personally tested it.
Multivariate or multi-dimensional time series
A single multivariate series is commonly a matrix where time spans the rows and the multiple variables span the columns.
DTW also works for them:
mv_series1 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
mv_series2 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
print(dtw_distance <- dtw_basic(mv_series1, mv_series2))
#> [1] 22.80421
The nice thing about proxy is that it can calculate distances between objects contained in lists too,
so you can put several multivariate series in lists of matrices:
mv_series <- lapply(1L:5L, function(dummy) {
matrix(rnorm(15L), nrow = 5L, ncol = 3L)
})
mv_distmat_dtwclust <- proxy::dist(mv_series, method = "dtw_basic")
print(mv_distmat_dtwclust)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.00000 27.43599 32.14207 36.42211 31.19279
#> [2,] 27.43599 0.00000 20.88470 23.88436 29.73219
#> [3,] 32.14207 20.88470 0.00000 22.14376 29.99899
#> [4,] 36.42211 23.88436 22.14376 0.00000 28.81111
#> [5,] 31.19279 29.73219 29.99899 28.81111 0.00000
Your case
Regardless of what you choose,
you can probably use proxy to get your result,
but since you haven't provided your whole data,
I can't give you a more specific example.
I presume that dtwclust::dtw_basic(control[, 1:4], lead[, 1:4], normalize = TRUE) would give you the distance between one pair of series,
assuming you're treating each one as a multivariate series with 4 variables.

If your question is "why am I getting this error?" the answer is that you're trying to subset a matrix, which is a two dimensional array, according to a 3rd dimension.
see:
dim(lead)
# [1] 26 2603
lead[,,6.418423] # yes, that's the value j has the first time through the loop
# This will reproduce your error
lead[,,1]
# This will also reproduce your error
Hopefully you can see now that you have a few problems:
You're trying to subset a matrix according to a 3rd dimension
Your i and j values are the values in control and lead respectively. You can use them as their values, or you can generate the index, e.g., for(i in seq_along(control) if you're planning to use it for something other than getting that same value out.
Taking it to the next step, it's unclear what you want to pass to the dist function. dist takes a single matrix and computes the distance between its rows. You seem to be trying to pass it two values from two different matrices, or perhaps two subsets of two different matrices. It looks like you might need to go back and look at the examples in the documentation for xtr

Related

igraph: summarize each node's neighbours characteristics

With an igraph object I would like to capture some features of each node's neighbours, for example the average degree of its neighbours.
I come up with this code, which is inelegant and quite slow.
How should I rethink it for large and complex networks?
library(igraph)
# Toy example
set.seed(123)
g <- erdos.renyi.game(10,0.2)
# Loop to calculate average degree of each node's neighbourhood
s <- character(0)
for(i in 1:gorder(g)){
n <- ego_size(g, nodes = i, order = 1, mindist = 1)
node_of_interest <- unique(unlist(ego(g, nodes = i, order = 1, mindist = 1)))
m <- mean(degree(g, v = node_of_interest, loops = TRUE, normalized = FALSE)-1)
s <- rbind(s,data.frame(node = i, neighbours = n, mean = m))
}
Expanding the data structure with rbind in a loop can get quite slow in R, because at every step it needs to allocate the space for the new object, and then copy it (see section 24.6 here). Also, you might be computing the degree of a node many times, if it s the neighbor of multiple nodes.
A possibly better alternative could be:
# add vertex id (not really necessary)
V(g)$name <- V(g)
# add degree to the graph
V(g)$degree <- degree(g, loops = TRUE, normalized = FALSE)
# get a list of neighbours, for each node
g_ngh <- neighborhood(g, mindist = 1)
# write a function that gets the means
get.mean <- function(x){
mean(V(g)$degree[x]-1)
}
# apply the function, add result to the graph
V(g)$av_degr_nei <- sapply(g_ngh, get.mean)
# get data into dataframe, if necessary
d_vert_attr <- as_data_frame(g, what = "vertices")
d_vert_attr
name degree av_degr_nei
1 1 0 NaN
2 2 1 2.0000000
3 3 2 1.0000000
4 4 1 1.0000000
5 5 2 1.0000000
6 6 1 1.0000000
7 7 3 0.6666667
8 8 1 0.0000000
9 9 1 0.0000000
10 10 0 NaN

R Loop random normal distribution

I'm trying to achieve the following: I want to generate 7 values from a normal distribution. Then I want to take these values, and using them as a mean generate 3 more (for each initial value) values from a normal distribution to replace them. I'd like to write this in a loop.
Let's use sd = 1.5 and sd = 0.7, and start with a mean of 0.
set.seed(1234)
mu.mat<-rnorm(7,mean=0,sd=1.5)
Gives me 7 nice values.
Then I want to create a number num [1:21] that generates 3 norm. distr. values using mean = first value of the just created list with sd = 0.7, three more using the second value and so on.
Of the form:
rnorm(3,mean=mu.mat[1],sd=0.7)
Just for all entries in a loop.
What I've tried:
mu.mat2<-NULL
for(i in 1:7) {
mu.mat2[i]<-rnorm(3,mean=mu.mat[i],sd=0.7)
}
Results in error: no. of items to replace is not a multiple of replacement length.
Any help on how to put this into a loop is very appreciated. Thanks in advance!
You don't need a loop. You can do:
rnorm(21, mean = rep(mu.mat, each = 3), sd = 0.7)
#> [1] -0.4811184 -1.2327778 -1.8603816 -3.3073277 -2.5190560 -3.2298056
#> [7] -2.3695570 -2.0228732 -1.1692489 2.0342910 1.0186855 1.0838678
#> [13] 0.5486730 -0.2439510 -0.1831147 2.2026024 0.1925301 -0.2153864
#> [19] 2.8944894 1.9213206 1.3804706
But the problem with your code is that you are trying to write three values (rnorm(3,mean=mu.mat[i],sd=0.7)) into a single atomic index mu.mat2[i]. It's not clear whether you were expecting a matrix as a result, but if so your loop would be:
mu.mat2 <- matrix(ncol = 3, nrow = 7)
for(i in 1:7) {
mu.mat2[i,] <- rnorm(3, mean = mu.mat[i], sd = 0.7)
}
If you were wanting the result as a 7 x 3 matrix, you can do:
matrix(rnorm(21, mean = rep(mu.mat, each = 3), sd = 0.7), ncol = 3, byrow = TRUE)
#> [,1] [,2] [,3]
#> [1,] -0.96624036 -1.4808460 -2.6824842
#> [2,] -2.88942108 -1.7299094 -3.0446737
#> [3,] -2.82034688 -0.9570087 -2.1822797
#> [4,] 0.58997289 1.0384926 1.8111506
#> [5,] -0.07705959 -0.1024418 0.7249310
#> [6,] 0.48851487 1.4729882 0.6496858
#> [7,] 1.47961292 1.5653253 2.0629409
Try replicate like below
> replicate(3,rnorm(length(mu.mat),mu.mat,0.7))
[,1] [,2] [,3]
[1,] -2.19324092 -1.13895278 -2.1540788
[2,] 0.02102746 0.33894402 0.1077604
[3,] 1.00363528 1.26895511 1.9483744
[4,] -3.85258144 -4.15638335 -4.0041507
[5,] -0.05518348 0.05766686 -0.3700564
[6,] 0.21570611 2.45016846 1.1614128
[7,] -0.81698877 -0.76824819 -1.5786689

using cor.test function in R

If x be a n*m matrix, when I use cor(x), I have a m*m correlation matrix between each pair of columns.
How can I use cor.test function on the n*m matrix to have a m*m p-value matrix also?
There may be an existing function, but here's my version. p_cor_mat runs cor.test on each pair of columns in matrix x and records the p-value. These are then put into a square matrix and returned.
# Set seed
set.seed(42)
# Matrix of data
x <- matrix(runif(120), ncol = 4)
# Function for creating p value matrix
p_cor_mat <- function(x){
# All combinations of columns
colcom <- t(combn(1:ncol(x), 2))
# Calculate p values
p_vals <- apply(colcom, MAR = 1, function(i)cor.test(x[,i[1]], x[,i[2]])$p.value)
# Create matrix for result
p_mat <- diag(ncol(x))
# Fill upper & lower triangles
p_mat[colcom] <- p_mat[colcom[,2:1]] <- p_vals
# Return result
p_mat
}
# Test function
p_cor_mat(x)
#> [,1] [,2] [,3] [,4]
#> [1,] 1.0000000 0.4495713 0.9071164 0.8462530
#> [2,] 0.4495713 1.0000000 0.5960786 0.7093539
#> [3,] 0.9071164 0.5960786 1.0000000 0.7466226
#> [4,] 0.8462530 0.7093539 0.7466226 1.0000000
Created on 2019-03-06 by the reprex package (v0.2.1)
Please also see the cor.mtest() function in the corrplot package.
https://www.rdocumentation.org/packages/corrplot/versions/0.92/topics/cor.mtest

Euclidean distance between data sets in R using rdist from "fields" package

I am using rdist to compute Euclidean distances between a matrix and itself:
> m = matrix(c(1,1,1,2,2,2,3,4,3),nrow=3, ncol=3)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 4
[3,] 1 2 3
library(fields)
> rdist(m)
[,1] [,2] [,3]
[1,] 1e-10 1e+00 1e-10
[2,] 1e+00 1e-10 1e+00
[3,] 1e-10 1e+00 1e-10
What confuses me is that I think it should have 0s on the diagonal (surely the distance of a vector to itself is 0?), and for the same reason it should have 0s where it compares the first and third row. The value that I see instead (1e-10) looks way to big to be numerical noise. What's going wrong?
EDIT: rdist is from the package fields.
First of all 1e-10 is simply 1*10^-10 which is 0.0000000001, so numericaly very close to 0 (as it is a result of square rooting, so the actual error in the computation is of row of magnitude 1e-20). Is it "too big"? Well, library is written in fortran, and is focused on speed, so it is quite acceptable. If you analyze the exact code, you will find out how it is computed:
# fields, Tools for spatial data
# Copyright 2004-2011, Institute for Mathematics Applied Geosciences
# University Corporation for Atmospheric Research
# Licensed under the GPL -- www.gpl.org/licenses/gpl.html
"rdist" <- function(x1, x2) {
if (!is.matrix(x1))
x1 <- as.matrix(x1)
if (missing(x2))
x2 <- x1
if (!is.matrix(x2))
x2 <- as.matrix(x2)
d <- ncol(x1)
n1 <- nrow(x1)
n2 <- nrow(x2)
par <- c(1/2, 0)
temp <- .Fortran("radbas", nd = as.integer(d), x1 = as.double(x1),
n1 = as.integer(n1), x2 = as.double(x2), n2 = as.integer(n2),
par = as.double(par), k = as.double(rep(0, n1 * n2)))$k
return(matrix(temp, ncol = n2, nrow = n1))
}
And the exact answer is hidden in the fortran files (in radfun.f called from radbas.f), where you can find the line
if( dtemp.lt.1e-20) dtemp =1e-20
which treats small (even 0) values as 1e-20, which after taking square root results in 1e-10. It seems that the motivation was to speed up the process by using logarithm of the value (as a result, square rooting is simply dividing by 2), which of course is not defined for 0.

Why does the calculation of Cohen's kappa fail across different packages on this contingency table?

I have a contingency table for which I would like to calculate Cohens's kappa - the level of agreement. I have tried using three different packages, which all seem to fail to some degree. The package e1071 has a function specifically for a contingency table, but that too seems to fail. Below is reproducable code. You will need to install packages concord, e1071, and irr.
# Recreate my contingency table, output with dput
conf.mat<-structure(c(810531L, 289024L, 164757L, 114316L), .Dim = c(2L,
2L), .Dimnames = structure(list(landsat_2000_bin = c("0", "1"
), MOD12_2000_binForest = c("0", "1")), .Names = c("landsat_2000_bin",
"MOD12_2000_binForest")), class = "table")
library(concord)
cohen.kappa(conf.mat)
library(e1071)
classAgreement(conf.mat, match.names=TRUE)
library(irr)
kappa2(conf.mat)
The output I get from running this is:
> cohen.kappa(conf.mat)
Kappa test for nominally classified data
4 categories - 2 methods
kappa (Cohen) = 0 , Z = NaN , p = NaN
kappa (Siegel) = -0.333333 , Z = -0.816497 , p = 0.792892
kappa (2*PA-1) = -1
> classAgreement(conf.mat, match.names=TRUE)
$diag
[1] 0.6708459
$kappa
[1] NA
$rand
[1] 0.5583764
$crand
[1] 0.0594124
Warning message:
In ni[lev] * nj[lev] : NAs produced by integer overflow
> kappa2(conf.mat)
Cohen's Kappa for 2 Raters (Weights: unweighted)
Subjects = 2
Raters = 2
Kappa = 0
z = NaN
p-value = NaN
Could anyone advise on why these might fail? I have a large dataset, but as this table is simple I didn't think that could cause such problems.
In the first function, cohen.kappa, you need to specify that you are using count data and not just a n*m matrix of n subjects and m raters.
# cohen.kappa(conf.mat,'count')
cohen.kappa(conf.mat,'count')
The second function is much more tricky. For some reason, your matrix is full of integer and not numeric. integer can't store really big numbers. So, when you multiply two of your big numbers together, it fails. For example:
i=975288
j=1099555
class(i)
# [1] "numeric"
i*j
# 1.072383e+12
as.integer(i)*as.integer(j)
# [1] NA
# Warning message:
# In as.integer(i) * as.integer(j) : NAs produced by integer overflow
So you need to convert your matrix to have integers.
# classAgreement(conf.mat)
classAgreement(matrix(as.numeric(conf.mat),nrow=2))
Finally take a look at the documentation for ?kappa2. It requires an n*m matrix as explained above. It just won't work with your (efficient) data structure.
Do you need to know specifically why those fail? Here is a function that computes the statistic -- in a hurry, so I might clean it up later (kappa wiki):
kap <- function(x) {
a <- (x[1,1] + x[2,2]) / sum(x)
e <- (sum(x[1,]) / sum(x)) * (sum(x[,1]) / sum(x)) + (1 - (sum(x[1,]) / sum(x))) * (1 - (sum(x[,1]) / sum(x)))
(a-e)/(1-e)
}
Tests/output:
> (x = matrix(c(20,5,10,15), nrow=2, byrow=T))
[,1] [,2]
[1,] 20 5
[2,] 10 15
> kap(x)
[1] 0.4
> (x = matrix(c(45,15,25,15), nrow=2, byrow=T))
[,1] [,2]
[1,] 45 15
[2,] 25 15
> kap(x)
[1] 0.1304348
> (x = matrix(c(25,35,5,35), nrow=2, byrow=T))
[,1] [,2]
[1,] 25 35
[2,] 5 35
> kap(x)
[1] 0.2592593
> kap(conf.mat)
[1] 0.1258621

Resources