What's an easy way to find the Euclidean distance between two n-dimensional vectors in Julia?
Here is a simple way
n = 10
x = rand(n)
y = rand(n)
d = norm(x-y) # The euclidean (L2) distance
For Manhattan/taxicab/L1 distance, use norm(x-y,1)
This is easily done thanks to the lovely Distances package:
Pkg.add("Distances") #if you don't have it
using Distances
one7d = rand(7)
two7d = rand(7)
dist = euclidean(one7d,two7d)
Also if you have say 2 matrices of 9d col vectors, you can get the distances between each corresponding pair using colwise:
thousand9d1 = rand(9,1000)
thousand9d2 = rand(9,1000)
dists = colwise(Euclidean(), thousand9d1, thousand9d2)
#returns: 1000-element Array{Float64,1}
You can also compare to a single vector e.g. the origin (if you want the magnitude of each column vector)
origin9 = zeros(9)
mags = colwise(Euclidean(), thousand9ds1, origin9)
#returns: 1000-element Array{Float64,1}
Other distances are also available:
Squared Euclidean
Cityblock
Chebyshev
Minkowski
Hamming
Cosine
Correlation
Chi-square
Kullback-Leibler divergence
Jensen-Shannon divergence
Mahalanobis
Squared Mahalanobis
Bhattacharyya
Hellinger
More details at the package's github page here.
Related
I have a 500x500 adjacency matrix of 1 and 0, and I need to calculate pagerank for each page. I have a code here, where R is the matrix and T=0.15 is a constant:
n = ncol(R)
B = matrix(1/n, n, n) # the teleportation matrix
A = 0.85 * R + 0.15 * B
ranks = eigen(A)$vectors[1] # my PageRanks
print(ranks)
[1] -0.5317519+0i
I don't have much experience with R, but I assume that the given output is a general pagerank, and I need a pagerank for each page.
Is there a way to construct a table of pageranks with relation to the matrix? I didn't find anything related to my particular case in the web.
Few points:
(1) You need to convert the binary adjacency matrix (R in your case) to a column-stochastic transition matrix to start with (representing probability of transitions between the pages).
(2) A needs to remain as column stochastic as well, then only the dominant eigenvector corresponding to the eigenvalue 1 will be the page rank vector.
(3) To find the first eigenvector of the matrix A, you need use eigen(A)$vectors[,1]
Example with a small 5x5 adjacency matrix R:
set.seed(12345)
R = matrix(sample(0:1, 25, replace=TRUE), nrow=5) # random binary adjacency matrix
R = t(t(R) / rowSums(t(R))) # convert the adjacency matrix R to a column-stochastic transition matrix
n = ncol(R)
B = matrix(1/n, n, n) # the teleportation matrix
A = 0.85 * R + 0.15 * B
A <- t(t(A) / rowSums(t(A))) # make A column-stochastic
ranks = eigen(A)$vectors[,1] # my PageRanks
print(ranks)
# [1] 0.05564937 0.05564937 0.95364105 0.14304616 0.25280990
print(ranks / sum(ranks)) # normalized ranks
[1] 0.03809524 0.03809524 0.65282295 0.09792344 0.17306313
I'm currently trying to recreate this Matlab function in R:
function X = uniform_sphere_points(n,d)
% X = uniform_sphere_points(n,d)
%
%function generates n points unformly within the unit sphere in d dimensions
z= randn(n,d);
r1 = sqrt(sum(z.^2,2));
X=z./repmat(r1,1,d);
r=rand(n,1).^(1/d);
X = X.*repmat(r,1,d);
Regarding the the right matrix division I installed the pracma package. My R code right now is:
uniform_sphere_points <- function(n,d){
# function generates n points uniformly within the unit sphere in d dimensions
z = rnorm(n, d)
r1 = sqrt(sum(z^2,2))
X = mrdivide(z, repmat(r1,1,d))
r = rnorm(1)^(1/d)
X = X * matrix(r,1,d)
return(X)
}
But it is not really working since I always end with a non-conformable arrays error in R.
This operation for sampling n random points from the d-dimensional unit sphere could be stated in words as:
Construct a n x d matrix with entries drawn from the standard normal distribution
Normalize each row so it has (2-norm) magnitude 1
For each row, compute a random value by taking a draw from the uniform distribution (between 0 and 1) and raise that value to the 1/d power. Multiply all elements in the row by that value.
The following R code does these operations:
unif.samp <- function(n, d) {
z <- matrix(rnorm(n*d), nrow=n, ncol=d)
z * (runif(n)^(1/d) / sqrt(rowSums(z^2)))
}
Note that in the second line of code I have taken advantage of the fact that multiplying a n x d matrix in R by a vector of length n will multiply each row by the corresponding value in that vector. This saves us the work of using repmat to construct matrices of exactly the same size as our original matrix for these sorts of row-specific operations.
I'm looking for a well-optimized function that accepts an n X n distance matrix and returns an n X k matrix with the indices of the k nearest neighbors of the ith datapoint in the ith row.
I find a gazillion different R packages that let you do KNN, but they all seem to include the distance computations along with the sorting algorithm within the same function. In particular, for most routines the main argument is the original data matrix, not a distance matrix. In my case, I'm using a nonstandard distance on mixed variable types, so I need to separate the sorting problem from the distance computations.
This is not exactly a daunting problem -- I obviously could just use the order function inside a loop to get what I want (see my solution below), but this is far from optimal. For example, the sort function with partial = 1:k when k is small (less than 11) goes much faster, but unfortunately returns only sorted values rather than the desired indices.
Try to use FastKNN CRAN package (although it is not well documented). It offers k.nearest.neighbors function where an arbitrary distance matrix can be given. Below you have an example that computes the matrix you need.
# arbitrary data
train <- matrix(sample(c("a","b","c"),12,replace=TRUE), ncol=2) # n x 2
n = dim(train)[1]
distMatrix <- matrix(runif(n^2,0,1),ncol=n) # n x n
# matrix of neighbours
k=3
nn = matrix(0,n,k) # n x k
for (i in 1:n)
nn[i,] = k.nearest.neighbors(i, distMatrix, k = k)
Notice: You can always check Cran packages list for Ctrl+F='knn'
related functions:
https://cran.r-project.org/web/packages/available_packages_by_name.html
For the record (I won't mark this as the answer), here is a quick-and-dirty solution. Suppose sd.dist is the special distance matrix. Suppose k.for.nn is the number of nearest neighbors.
n = nrow(sd.dist)
knn.mat = matrix(0, ncol = k.for.nn, nrow = n)
knd.mat = knn.mat
for(i in 1:n){
knn.mat[i,] = order(sd.dist[i,])[1:k.for.nn]
knd.mat[i,] = sd.dist[i,knn.mat[i,]]
}
Now knn.mat is the matrix with the indices of the k nearest neighbors in each row, and for convenience knd.mat stores the corresponding distances.
I have two vectors x and w. vector w is a numerical vector of weights the same length as x.
How can we get the weighted average of neighbor elements in vector x( weighted average of the first element and second one , then weighted average of the secnod and third elements, ..... For example, these vectors are as follows:
x = c(0.0001560653, 0.0001591889, 0.0001599698, 0.0001607507, 0.0001623125,
0.0001685597, 0.0002793819, 0.0006336307, 0.0092017241, 0.0092079042,
0.0266525118, 0.0266889564, 0.0454923285, 0.0455676525, 0.0457005450)
w = c(2.886814e+03, 1.565955e+04, 9.255762e-02, 7.353589e+02, 1.568933e+03,
5.108046e+05, 6.942338e+05, 4.912165e+04, 9.257674e+00, 3.609918e+02,
8.090436e-01, 1.072975e+00, 1.359145e+00, 9.828314e+00, 9.455688e+01)
sapply(1:(length(x)-1), function(i) weighted.mean(x[i:(i+1)], w[i:(i+1)]))
A functional programming approach - will be slower than `#David Robinsons
# lots of `Map` \ functional programming
mapply(weighted.mean,
x = Map(c, head(x,-1),tail(x,-1)),
w = Map(c, head(w,-1) ,tail(w,-1))
I'm using R to perform an hierarchical clustering. As a first approach I used hclust and performed the following steps:
I imported the distance matrix
I used the as.dist function to transform it in a dist object
I run hclust on the dist object
Here's the R code:
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
hclust(d, "ward")
At this point I would like to do something similar with the function pvclust; however, I cannot because it's not possible to pass a precomputed dist object. How can I proceed considering that I'm using a distance not available among those provided by the dist function of R?
I've tested the suggestion of Vincent, you can do the following (my data set is a dissimilarity matrix):
# Import you data
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
# Compute the eigenvalues
x <- cmdscale(d,1,eig=T)
# Plot the eigenvalues and choose the correct number of dimensions (eigenvalues close to 0)
plot(x$eig,
type="h", lwd=5, las=1,
xlab="Number of dimensions",
ylab="Eigenvalues")
# Recover the coordinates that give the same distance matrix with the correct number of dimensions
x <- cmdscale(d,nb_dimensions)
# As mentioned by Stéphane, pvclust() clusters columns
pvclust(t(x))
If the dataset is not too large, you can embed your n points in a space of dimension n-1, with the same distance matrix.
# Sample distance matrix
n <- 100
k <- 1000
d <- dist( matrix( rnorm(k*n), nc=k ), method="manhattan" )
# Recover some coordinates that give the same distance matrix
x <- cmdscale(d, n-1)
stopifnot( sum(abs(dist(x) - d)) < 1e-6 )
# You can then indifferently use x or d
r1 <- hclust(d)
r2 <- hclust(dist(x)) # identical to r1
library(pvclust)
r3 <- pvclust(x)
If the dataset is large, you may have to check how pvclust is implemented.
It's not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by #Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on another question on CrossValidated). In the latter case, the authors of pvclust provide an example on how to use a custom distance function, although that means you will have to install their "unofficial version".