Coefficient of Euclidean Distance - similarity

I have been trying to calculate correlation coefficient (say r) and euclidean distance (say d) between two random variables X and Y. It is known that -1 <= r <= 1, whereas d >= 0. To compare these similarity metrics (mostly for visualization purpose), first I want to calculate a coefficient for d, so that it will be between 0 and 1 or between -1 and 1 like r. One way to scale d is dividing it by its maximum, i.e. d* = d/max(d). However, the maximum max(d) is not a global value and when someone uses different data points for X and Y, then it is not longer comparable to the first one. Therefore, I'm asking this community to suggest me a better way of scaling the euclidean distance that ranges in [0, 1] or [-1, 1].
I appreciate your cooperation in advance.
Alemu

Related

Matrix dimension do not mach in regression formula

I'm trying to calculate this regression formula, but I have problem with the dimension calculation, they are not correct:
Where:
X-a matrix with dimensions 200x20, n=200 samples, p=20 predictors,
y-a matrix with dimensions 200x1,
- a sequence of coefficients, dimensions 20x1, and k=1,2,3...
- dimensions 20x200
j- and value from 1...p so from 1...20,
The problem is when I calculate
For example for k=20, k-1=19 i have and the dimensions do not match to do a substraction 200x1 - 200x20 x 1x1 =200x1 - 200x20 will not work.
If I take all the beta vector then it is correct. does this: mean to take the 19th value of Beta and to multiply it with the matrix X?
Source of the formula:
You should be using the entire beta vector at each stage of the calculation.
(Tibshirani has been a bit permissive with his use of notation, perhaps...)
The k is just a counter for which step of the algorithm we are on. Right at the start (k = 0 or "step 0") we initialise the entire beta vector to have all elements equal to zero:
At each step of the algorithm (steps k = 1, 2, 3... and so on) we use our previous estimate of the vector beta ( calculated in step k - 1) to calculate a new improved estimate for the vector beta (). The superscript number is not an index into the vector, rather it is a label telling us at which stage of the algorithm that beta vector was produced.
I hope this makes sense. The important point is that each of the values is a different 20x1 vector.

Points uniformly distributed on unit disk (2D)

I am trying to generate 10,000 points from the uniform distribution on the unit disk and plot these points.
The method I am using has three steps. The first step is generating the magnitude of the point x. This point has cdf F(x) = x^2 min(x) = 0 and max(x) = 1. The second step involves generating a 2 dimensional vector (which I will call y) from the multivariate normal distribution with mu being the zero vector and sigma being the 2x2 identity matrix - MVN(0,I). Last I normalize the vector y to have length x. I have tried to code the solution in R but I do not think my answer is correct. I would really appreciate if I could be pointed in the right direction.
u = runif(10000)
x = u^2
y = mvrnorm(10000, mu=rep(0,2), Sigma=diag(2))
y_norm = (x*y)/sqrt(sum(y^2))
plot(y_norm, asp = 1)
I used the MASS package for mvrnorm. Also I have included the plot that I ended up with:
You need to compute the length of each of the rows in your y matrix, you are getting the square root of the sum of all the numbers in y, which is just scaling your multinomial by a constant. Also, you need x to be sqrt(u) rather than u^2 - this code normalises each row by its length and users sqrt(u) scaling and it looks nice and uniform:
plot(sqrt(u)*y/sqrt(y[,1]^2+y[,2]^2))
There are better ways of making uniform points on a disc, unless this is just an exercise to do it this way...

Calculate weighted mean point from array of 3D points

I am writing the program in Cocoa but I think that the solution must be quite universal.
I have a set of points represented by 3D vectors. Each point has a weight assigned to it. The weight is in the range from 0 to 1. The sum of all weights isn't equal to 1.
How should the weighted mean point be calculated from such set?
Either programmatic or pure mathematical solution will be helpful.
Of course if Cocoa has some specific tools for solving this task, I would very appreciate this information.
Simply sum all vectors scaled by their weight. Finally, divide by the sum of all weights. This has the same effect as first normalizing all weights to sum to 1.
Pseudo-code:
sum = [0, 0, 0]
totalWeights = 0
for each point p with associated weight w:
sum += p * w
totalWeights += w
mean = sum / totalWeights

How can I get a more precise value from a list of weighted numbers?

A is a list of increasing fixed values (frequencies). It doesn't step evenly but the values never change.
A = 5, 10, 17, 23, 30
Each value in A is weighted by the corresponding value in list B (volume).
B = 2.2, 3.5, 4.4, 3.2, 1.1
I want to calculate the loudest frequency (A). The problem is that the loudest frequency may be 14 but I can't tell from this data set. How can I calculate, based on list B, what the loudest frequency might be in list A?
Here's a rough outline of a solution: I haven't nutted out all the maths for you, but I hope it helps.
Approximate the frequency amplitude using interpolatory splines.
This will give you the function between each adjacent pair of frequency sample points as a sum of basis functions for the frequency values surrounding the pair.
This means you have a function f(x) defined on each interval.
f(x) = A phi_0(x) + B phi_1(x) + C phi_2(x) + D phi_3(x)
At the maximum
0 = f'(x) = A phi_0'(x) + B phi_1(x) + C phi_2(x) + D phi_3(x)
If you're using a cubic spline interpolation, the derivate will be quadratic in x. And thus you can obtain 2 potential extremums for each interval.
Scan through all the intervals, calculate that extremum. Check if it falls inside the interval .. if it doesn't its not really a potential extremum. You now have a list of all the potential internal maxima. Add to this list the values at each node. The maximum from this list will be the maximum value of the interpolatory spline.
You have not been terribly clear here (IMHO). I don't know what it means to "weight" values in A by B. Do we assume we are to treat B as a function of A? Therefore, we are looking for the frequency (A) such that B attains its maximum value, AS A FUNCTION OF A?
If so, this is simply a problem of interpolation, then maximizing the interpolant. Depending on what tools you have available to you, you might do a spline interpolation, as shown in this figure. Then it would be merely a problem of finding the maximum value of that spline.
This spline model suggests the maximum value is Bmax=4.4132, which occurs at A=16.341.
Alternatively, one might simply fit an interpolating polynomial through the points. Your curve is not that noisy that a 4th degree polynomial will be ill-posed. (Had you more points, a high order polynomial would be a terrible idea. Then you might use a piecewise Lagrange interpolant.) Done in MATLAB,
>> P = polyfit(A,B,4)
P =
6.6992e-05 -0.0044803 0.084249 -0.34529 2.3384
I'll plot the polynomial itself.
>> ezplot(#(x) polyval(P,x),[5,30])
We can find the maximum value by looking for a root (zero value) of the derivative function. Since the derivative is a cubic polynomial, there are three roots. Only one of them is of interest.
>> roots(polyder(P))
ans =
31.489
16.133
2.5365
The root of interest is at 16.133, which is consistent with the prediction from the interpolating spline.

how to calculate all pairwise distances in two dimensions

Say I have data concerning the position of animals on a 2d plane (as determined by video monitoring from a camera directly overhead). For example a matrix with 15 rows (1 for each animal) and 2 columns (x position and y position)
animal.ids<-letters[1:15]
xpos<-runif(15) # x coordinates
ypos<-runif(15) # y coordinates
raw.data.t1<-data.frame(xpos, ypos)
rownames(raw.data.t1) = animal.ids
I want to calculate all the pairwise distances between animals. That is, get the distance from animal a (row 1) to the animal in row 2, row3...row15, and then repeat that step for all rows, avoiding redundant distance calculations. The desire output of a function that does this would be the mean of all the pairwise distances. I should clarify that I mean the simple linear distance, from the formula d<-sqrt(((x1-x2)^2)+((y1-y2)^2)). Any help would be greatly appreciated.
Furthermore, how could this be extended to a similar matrix with an arbitrarily large even number of columns (every two columns representing x and y positions at a given time point). The goal here would be to calculate mean pairwise distances for every two columns and output a table with each time point and its corresponding mean pairwise distance. Here is an example of the data structure with 3 time points:
xpos1<-runif(15)
ypos1<-runif(15)
xpos2<-runif(15)
ypos2<-runif(15)
xpos3<-runif(15)
ypos3<-runif(15)
pos.data<-cbind(xpos1, ypos1, xpos2, ypos2, xpos3, ypos3)
rownames(pos.data) = letters[1:15]
The aptly named dist() will do this:
x <- matrix(rnorm(100), nrow=5)
dist(x)
1 2 3 4
2 7.734978
3 7.823720 5.376545
4 8.665365 5.429437 5.971924
5 7.105536 5.922752 5.134960 6.677726
See ?dist for more details
Why do you compare d<-sqrt(((x1-x2)^2)+((y1-y2)^2))?
Do d^2<-(((x1-x2)^2)+((y1-y2)^2)). It will cost you much less.

Resources