Error in R: Cosine similarity and MDS - r

I calculate the cosine similarity with cosine() from the package 'lsa'. Here with three test vectors:
d <- data.frame(c(-1,1,0,-1,1,1,-1,1,0),c(-1,1,1,1,-1,1,-1,0,1),c(0,0,1,0,-1,-1,0,1,-1))
colnames(d) <- c("vector1","vector2","vector3")
d_dist <- cosine(as.matrix(d))
Now, I want to do dimensionality reduction with cmdscale and after that plot it as a scatterplot:
fit <- cmdscale(d_dist,k=2)
x <- fit[,2]
y <- fit[,1]
plot(x,y)
But I always get the warning In cmdscale (d_dist, k = 2): only 0 of the first 2 eigenvalues ​​are> 0 [translated from German] and an empty fit object.
What am I doing wrong? Thank you so much for your help!

The input should be a distance matrix. E.g.:
d_dist <- 1-d_dist
fit <- cmdscale(d_dist,k=2)
x <- fit[,2]
y <- fit[,1]
plot(x,y)

Related

Manually calculate two-sample kolmogorov-smirnov using ECDF

I am trying to manually compute the KS statistic for two random samples. As far as I understood the KS statistic D is the maximum vertical deviation between the two CDFs. However, manually calculating the differences between the two CDF and running the ks.test from the base R yields different results. I wonder where is the mistake.
set.seed(123)
a <- rnorm(10000)
b <- rnorm(10000)
### Manual calculation
# function for calculating manually the ecdf
decdf <- function(x, baseline, treatment) ecdf(baseline)(x) - ecdf(treatment)(x)
#Difference between the two CDFs
d <- curve(decdf(x,a,b), from=min(a,b), to=max(a,b))
# getting D
ks <- max(abs(d$y))
#### R-Base calculation
ks.test(a,b)
The R-Base D = 0.0109 while the manual calculation is 0.0088. Any help explaining the difference is appreciated.
I attach the R-Base source code ( a bit cleaned up)
n <- length(a)
n.x <- as.double(n)
n.y <- length(b)
n <- n.x * n.y/(n.x + n.y)
w <- c(a, b)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
STATISTIC <- max(abs(z))
By default, curve evaluates the function on a subdivision of 100 points between from and to. By restricting to these 100 points, it's possible that you miss the value for which the maximum difference is attained.
Instead, evaluate the difference at all points where the ecdf's jump and you are sure to catch the value for which the maximum difference is attained.
set.seed(123)
a <- rnorm(10000)
b <- rnorm(10000)
Fa <- ecdf(a)
Fb <- ecdf(b)
x <- c(a,b) # the points where Fa or Fb jump
max(abs(Fa(x) - Fb(x)))
# [1] 0.0109

How to draw dendrogram using ape package in R?

I have a distance matrix of ~200 x 200 size I an unable to plot a dendrogram using the BioNJ option of ape library in R
The size is big to make the plot visible
What ways can I improve the visibility
Two options depending on your data
If you need to calculate the distance matrix of your data then use
set.seed(1) # makes random sampling with rnorm reproducible
# example matrix
m <- matrix(rnorm(100), nrow = 5) # any MxN matrix
distm <- dist(m) # distance matrix
hm <- hclust(distm)
plot(hm)
If your data is a distance matrix (must be a square matrix!)
set.seed(1)
# example matrix
m <- matrix(rnorm(25), nrow=5) # must be square matrix!
distm <- as.dist(m)
hm <- hclust(distm)
plot(hm)
A 200 x 200 distance matrix gives me a reasonable plot
set.seed(1)
# example matrix
m <- matrix(rnorm(200*200), nrow=200) # must be square matrix!
distm <- as.dist(m)
hm <- hclust(distm)
plot(hm)

MIC correlation between 2 matrices in R

The MINERVA package provide a function to perform the Maximal Information Coefficient (MIC). The description of the package stipulates that the function mine (x,y) works only with 2 matrices A and B of the same size.
Here, I would like to obtain the MIC coefficient value obtained from the correlation of two A and B matrices of different size, respectfully, A is n by m and B is n by z, with n being the number of observations (rows).
In other words, my aim is to obtain a C matrix of m x z , which returns, for each value, give the MIC correlation coefficient values (and, if possible, the associated P value, if any).
I provide an example of what I want with the Pearson correlation.
set.seed(1)
x <- matrix(rnorm(20), nrow=5, ncol=10)
y <- matrix(rnorm(15), nrow=5, ncol=20)
P <- cor(x, y=y)
I mailed one author of the MINERVA package without success, is there any way I can apply the mine function to obtain the desired m by z correlation?
Let me answer to my own post. In the code below, I use the loop function, which may be not the smartest/fastest way to to do it, but it work as expected.
library(minerva)
set.seed(1)
x <- matrix(rnorm(20), nrow=5, ncol=10)
y <- matrix(rnorm(15), nrow=5, ncol=20)
Result = matrix(ncol = ncol(y),nrow = ncol(x))
for(i in 1:ncol(x))
{Thisvar = x[,i]
print(i)
for(k in 1:ncol(y))
{Thisvar2 = y[,k]
res = mine(Thisvar,Thisvar2, master=TRUE, use="all.obs")
Result[i,k] = res$MIC
}}

Calculating divergence between joint posterior distributions

I wish to calculate the distance between two 3-dimensional posterior distributions. The draws are stored at two 30,000x3 matrices.
So far I have been successful in calculating Total Variation distance between two 2-dimensional posteriors (two 30,000x2 matrices) by splitting the grid into bins. However, I am having trouble calculating the divergence between posteriors with more parameters. Some examples of related distance measures can be found here.
NOTE: I do not wish to calculate the distance between the marginals (column-wise entries), rather than obtain an overall value after comparing the joint distributions in R.
I would really appreciate it if somebody could point out what I am missing here.
EDIT 1: Some example code for calculating Total variation distance between posterior samples stored in two matrices has been added below:
EDIT 2: This is a R question.
set.seed(123)
comparison.2D <- matrix(rnorm(40000*2,0,1),ncol=2)
ground.truth.2D <- matrix(rnorm(40000*2,0,2),ncol=2)
# Function to calculate TVD between matrices with 2 columns:
Total.Variation.Distance.2D<-function(true,
comparison,
burnin,
window.size){
# Bandwidth for theta.1.
my_bw_x<-window.size
# Bandwidth for theta.2.
my_bw_y<-window.size
range_x<-range(c(true[-c(1:burnin),1],comparison[-c(1:burnin),1]))
range_y<-range(c(true[-c(1:burnin),2],comparison[-c(1:burnin),2]))
xx <- seq(range_x[1],range_x[2],by=my_bw_x)
yy <- seq(range_y[1],range_y[2],by=my_bw_y)
true.pointidxs <- matrix( c( findInterval(true[-c(1:burnin),1], xx),
findInterval(true[-c(1:burnin),2], yy) ), ncol=2)
comparison.pointidxs <- matrix( c( findInterval(comparison[-c(1:burnin),1], xx),
findInterval(comparison[-c(1:burnin),2], yy) ), ncol=2)
# Count the frequencies in the corresponding cells:
square.mat.dims <- max(length(xx),nrow=length(yy))
frequencies.true <- frequencies.comparison <- matrix(0, ncol=square.mat.dims, nrow=square.mat.dims)
for (i in 1:dim(true.pointidxs)[1]){
frequencies.true[true.pointidxs[i,1], true.pointidxs[i,2]] <- frequencies.true[true.pointidxs[i,1],
true.pointidxs[i,2]] + 1
frequencies.comparison[comparison.pointidxs[i,1], comparison.pointidxs[i,2]] <- frequencies.comparison[comparison.pointidxs[i,1],
comparison.pointidxs[i,2]] + 1
}# End for
# Normalize frequencies matrix:
frequencies.true <- frequencies.true/dim(true.pointidxs)[1]
frequencies.comparison <- frequencies.comparison/dim(comparison.pointidxs)[1]
TVD <-0.5*sum(abs(frequencies.comparison-frequencies.true))
return(TVD)
}# End function
TVD.2D <- Total.Variation.Distance.2D(true=ground.truth.2D, comparison=comparison.2D,burnin=10000,window.size=0.05)

Derivative of a set of points

So I know you can find the derivative of something like: "x^3-6*x^2" by doing: D(expression(x^3-6*x^2), 'x'), but what if I need to find the first derivative maximum of a list of values such as:
value <- c(610,618,627,632,628,634,634,628,634,642,637,643,653,666,684,717,787,923,1197,1716,2638,4077,5461,7007,8561,9994,11278,12382,13382,14252)
these values are the y coordinate and the x coordinate starts at 1 and increments by 1. IE the first point is (1,610) second is (2,618) etc. -Thanks
Consider using the package numDerive from CRAN. It has a function grad that computes derivative of a function at a point. Example:
f = function(x) x^3 - 6*x^2
library(numDeriv)
grad(f, 1) #derivative of f at x=1
To solve your problem with a list of values, use a for loop:
xval <- c(YOUR VALUES HERE)
xval.derivatives <- c() #empty vector to hold
for(i in 1:length(xval)) xval.derivatives[i] <- grad(f,xval[i])
The gradient function from the pracma package calculates the derivative from a vector of values.
library(pracma)
value <- c(610,618,627,632,628,634,634,628,634,642,637,643,653,666,684,717,787,923,1197,1716,2638,4077,5461,7007,8561,9994,11278,12382,13382,14252)
value_prime <- pracma::gradient(value, h1 = 1)
plot(value_prime)
Alternatively, fit a spline.
spl <- smooth.spline(1:length(value), y=value)
pred <- predict(spl)
pred.prime <- predict(spl, deriv=1)
plot(pred.prime, type = 'b')
If you are interested in higher derivatives, check the pspline package.

Resources