Similarity distance measures - vector

Vectors like this
v1 = {0 0 0 1 1 0 0 1 0 1 1}
v2 = {0 1 1 1 1 1 0 1 0 1 0}
v3 = {0 0 0 0 0 0 0 0 0 0 1}
Need to calculate similarity between them. Hamming distance between v1 and v2 is 4 and between v1 and v3 is also 4. But because I am interested in the groups of '1' which are together for me v2 is far more similar to v1 then v3 is.
Is there any distance metrics that can capture this in the data?
The data represent occupancy of house in time, that's why it is important to me. '1' means occupied, '0' means non occupied.

It sounds like you need cosine similarity measure:
similarity = cos(v1, v2) = v1 * v2 / (|v1| |v2|)
where v1 * v2 is dot product between v1 and v2:
v1 * v2 = v1[1]*v2[1] + v1[2]*v2[2] + ... + v1[n]*v2[n]
Essentially, dot product shows how many elements in both vectors have 1 at the same position: if v1[k] == 1 and v2[k] == 1, then final sum (and thus similarity) is increased, otherwise it isn't changed.
You can use dot product itself, but sometimes you would want final similarity to be normalized, e.g. be between 0 and 1. In this case you can divide dot product of v1 and v2 by their lengths - |v1| and |v2|. Essentially, vector length is square root of dot product of the vector with itself:
|v| = sqrt(v[1]*v[1] + v[2]*v[2] + ... + v[n]*v[n])
Having all of these, it's easy to implement cosine distance as follows (example in Python):
from math import sqrt
def dot(v1, v2):
return sum(x*y for x, y in zip(v1, v2))
def length(v):
return sqrt(dot(v, v))
def sim(v1, v2):
return dot(v1, v2) / (length(v1) * length(v2))
Note, that I described similarity (how much two vectors are close to each other), not distance (how far they are). If you need exactly distance, you can calculate it as dist = 1 / sim.

There are literally hundreds of distance functions, including distance measures for sets, such as Dice and Jaccard.
You may want to get the book "Dictionary of Distance Functions", it's pretty good.

Case 1: IF the position of the ones in the series is relevant, THEN:
I recommend Dynamic Time Warping Distance (DTW). In application of time-series data it has proven incredibly useful.
To check whether it can be applied to your problem, I used the code presented here: https://jeremykun.com/2012/07/25/dynamic-time-warping/
d13 = dynamicTimeWarp(v1,v3)
d12 = dynamicTimeWarp(v1,v2)
d23 = dynamicTimeWarp(v2,v3)
d23,d12,d13
(3, 1, 3)
As you see, d12 is lowest, therefore v1 and v2 are most similiar. Further information of DTW can be found anywhere in this forum and for research papers, I recommend anything by Eamonn Keogh.
Case 2: Position of ones is not relevant:
I simply agree Deepu for taking the average as a feature.

I think you can simply take the average of the values in each set. For example v1 here will have an average 0.4545, average of v2 is 0.6363 and average of v3 is 0.0909. If the only possible values in the set are 0 and 1, then the sets with equal or nearly equal values will serve your purpose.

There is a web site introducing the various type of vector similarity methods
http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
I think it will help you to decide what similarity you should use
.
.
Briefly explaining the above link, there are five popular similarity measurement between vectors
Euclidean Distance - Simply the absolute distance between the vectors
Cosine - Cosine degree(theta) difference between the vectors
Manhattan -the sum of the absolute differences of their Cartesian coordinates, for example,
In a plane with p1 at (x1, y1) and p2 at (x2, y2). Manhattan distance
= |x1 – x2| + |y1 – y2|
Minkowski - generalized metric form of Euclidean distance and Manhattan distance
Jaccard - Similarity between the objects. So each feature in one set will be compared to another set and finds out its difference
.
With the keyword above you can google for further explanation.
Hope it would help you

Related

How to understand the result of Discrete Fourier Transform under period finding?

I am learning how to use Discrete Fourier Transform(DFT) to find the period about a^x mod(N), in which x is a positive integer, a is any prime number, and N is the product of two prime factors p and q.
For example, the period of 2^x mod(15) is 4,
>>> for x in range(8):
... print(2**x % 15)
...
Output: 1 2 4 8 1 2 4 8
^-- the next period
and the result of DFT is as following,
(cited from O'Reilly Programming Quantum Computers chapter 12)
There are 4 spikes with 4-unit spacing, and I think the latter 4 means that period is 4.
But, when N is 35 and period is 12
>>> for x in range(16):
... print(2**x % 35)
...
Output: 1 2 4 8 16 32 29 23 11 22 9 18 1 2 4 8
^-- the next period
In this case, there are 8 spikes greater than 100, whose locations are 0, 5, 6, 11, 32, 53, 58, 59, respectively.
Does the location sequence imply the magic number 12? And how to understand "12 evenly spaced spikes" from the righthand graph?
(cited from O'Reilly Programming Quantum Computers chapter 12)
see How to compute Discrete Fourier Transform? and all the sublinks especially How do I obtain the frequencies of each value in an FFT?.
As you can see i-th element of DFT result (counting from 0 to n-1 including) represent Niquist frequency
f(i) = i * fsampling / n
And DFT result uses only those sinusoidal frequencies. So if your signal does have different one (even slightly different frequency or shape) aliasing occurs.
Aliased sinusoid creates 2 frequencies in DFT output one higher and one lower frequency.
Any sharp edge is translated to many frequencies (usually continuous spectrum like your last example)
The f(0) is no frequency and represents DC offset.
On top of all this if the input of your DFT is real domain then the DFT result is symmetric meaning you can use only first half of the result as the second is just mirror image (not including the f(0)) which makes sense as you can not represent bigger than fsampling/2 frequency in real domain data.
Conclusion:
You can not obtain frequency of signal used by DFT as there is infinite number of ways how such signal can be computed. DFT is reconstructing the signal using sinwaves and your signal is definately no sinwave so the results will not match what you think.
Matching niquist frequencies to yours is done by correctly chosing the n for DFT however without knowing the frequency ahead you can not do this ...
It may be possible to compute the singular sinwave frequency from its 2 aliases however your signal is no sinwave so that is not applicable for your case anyway.
I would use different approaches to determine frequency of integer numeric signal:
compute histogram of signal
so count how many of each number there is
test possible frequencies
You can brute force all possible periods of signal and test if consequent periods are the same however for big data is this not optimal...
We can use histogram to speed this up. So if you look at the counts cnt(ix) from histogram for periodic signal of frequency f and period T in data of size n then the period of signal should be a common divider of all the counts
T = n/f
k*f = GCD(all non zero cnt[i])
where k divides the GCD result. However in case n is not exact multiple of T or the signal has noise or slight deviations in it this will not work. However we can at least estimate the GCD and test all frequencies around which will be still faster than brute force.
So for each count (not accounting for noise) it should comply this:
cnt(ix) = ~ n/(f*k)
k = { 1,2,3,4,...,n/f}
so:
f = ~ n/(cnt(ix)*k)
so if you got signal like this:
1,1,1,2,2,2,2,3,3,1,1,1,2,2,2,2,3,3,1
then histogram would be cnt[]={0,7,8,4,0,0,0,0,...} and n=19 so computing f in periods per n for each used element leads to:
f(ix) = n/(cnt(ix)*k)
f(1) = 19/(7*k) = ~ 2.714/k
f(2) = 19/(8*k) = ~ 2.375/k
f(3) = 19/(4*k) = ~ 4.750/k
Now the real frequency should be a common divider (CD) of results so taking biggest and smallest counts rounded up and down (ignoring noise) leads to these options:
f = CD(2,4) = 2
f = CD(3,4) = none
f = CD(2,5) = none
f = CD(3,5) = none
so now test frequency (luckily its just one valid in this case) 2 periods per 19 samples meaning T = ~ 9.5 so test rounded up and down ...
signal(t+ 0)=1,1,1,2,2,2,2,3,3,1,1,1,2,2,2,2,3,3,1
signal(t+ 9)=1,1,1,2,2,2,2,3,3,1 // check 9 elements
signal(t+10)=1,1,2,2,2,2,3,3,1,? // check 10 elements
As you can see signal(t...t+9)==signal(t+9...t+9+9) meaning the period is T=9.

How to solve a function subject to a condition R

I'm working on a paper for school.
I have to look for the tangency portfolio using this formula, which I translated in R as follows (basic tangency portfolio formula with matrician algebra.)
nom <- solve(Mat) %*% (ER - RF)
denom <- Ones %*% nom
w <- nom %*% solve(denom)
This formula gives me also negative weights (short selling), and I would like to add a constraint that only allows weights from 0 to 1, with sum 1.
Who can help me with this?
Example:
If I run the code as now, let's say with 3 assets, I will get also negative weights for some assets (ex c(0.20, -0.40, 0.80)), with sum == 1. (In this case, short selling would be enabled). This is the tangency portfolio that maximizes the sharpe ratio, given the expected return and variance. What I'd love to have is the tangency portfolio without short selling allowed. In the example, I would have the weights something like c(0.18, 0.05, 0.72). It will be incorrect to replace the negative numbers with 0 and also the >1 with 1, as the sum of all the weights should be one
as mentioned by #Shree, an example would help understanding more precisely what you're after.
From what I understood, you want w to be bounded between [0,1]? From the top of my head you could either shift and scale w
## Shifting
shift_w <- (w-min(w))
## Scaling
shift_w/max(shift_w)
Or, more brutally, replace the values <0 or >1 by a given value or function
## What to replace the value with when negative
replace_negative <- 0
## What to replace the value with when superior to 1
replace_one <- 1
## Making sure w is bounded between 0 and 1
ifelse(ifelse(w < 0, replace_negative, w) > 1, replace_one w)
Note that replace_negative and replace_one can also be a function f(w) if you want something more complex.

Need to calculate the percentage of distribution

I have a set of numbers for a given set of attributes:
red = 4
blue = 0
orange = 2
purple = 1
I need to calculate the distribution percentage. Meaning, how diverse is the selection? Is it 20% diverse? Is it 100% diverse (meaning an even distribution of say 4,4,4,4)?
I'm trying to create a sexy percentage that approaches 100% the more the individual values average to the same value, and a lower value the more they get lopsided.
Has anyone done this?
Here is the PHP conversion of the below example. For some reason it's not producing 1.0 with a 4,4,4,4 example.
$arrayChoices = array(4,4,4,4);
foreach($arrayChoices as $p)
$sum += $p;
print "sum: ".$sum."<br>";
$pArray = array();
foreach($arrayChoices as $rec)
{
print "p vector value: ".$rec." ".$rec / $sum."\n<br>";
array_push($pArray,$rec / $sum);
}
$total = 0;
foreach($pArray as $p)
if($p > 0)
$total = $total - $p*log($p,2);
print "total = $total <br>";
print round($total / log(count($pArray),2) *100);
Thanks in advance!
A simple, if rather naive, scheme is to sum the absolute differences between your observations and a perfectly uniform distribution
red = abs(4 - 7/4) = 9/4
blue = abs(0 - 7/4) = 7/4
orange = abs(2 - 7/4) = 1/4
purple = abs(1 - 7/4) = 3/4
for a total of 5.
A perfectly even spread will have a score of zero which you must map to 100%.
Assuming you have n items in c categories, a perfectly uneven spread will have a score of
(c-1)*n/c + 1*(n-n/c) = 2*(n-n/c)
which you should map to 0%. For a score d, you might use the linear transformation
100% * (1 - d / (2*(n-n/c)))
For your example this would result in
100% * (1 - 5 / (2*(7-7/4))) = 100% * (1 - 10/21) ~ 52%
Better yet (although more complicated) is the Kolmogorov–Smirnov statistic with which you can make mathematically rigorous statements about the probability that a set of observations have some given underlying probability distribution.
One possibility would be to base your measure on entropy. The uniform distribution has maximum entropy, so you could create a measure as follows:
1) Convert your vector of counts to P, a vector of proportions
(probabilities).
2) Calculate the entropy function H(P) for your vector of
probabilities P.
3) Calculate the entropy function H(U) for a vector of equal
probabilities which has the same length as P. (This turns out
to be H(U) = -log(1.0 / length(P)), so you don't actually
need to create U as a vector.)
4) Your diversity measure would be 100 * H(P) / H(U).
Any set of equal counts yields a diversity of 100. When I applied this to your (4, 0, 2, 1) case, the diversity was 68.94. Any vector with all but one element having counts of 0 has diversity 0.
ADDENDUM
Now with source code! I implemented this in Ruby.
def relative_entropy(v)
# Sum all the values in the vector v, convert to decimal
# so we won't have integer division below...
sum = v.inject(:+).to_f
# Divide each value in v by sum, store in new array p
pvals = v.map{|value| value / sum}
# Build a running total by calculating the entropy contribution for
# each p. Entropy is zero if p is zero, in which case total is unchanged.
# Finally, scale by the entropy equivalent of all proportions being equal.
pvals.inject(0){|total,p| p > 0 ? (total - p*Math.log2(p)) : total} / Math.log2(pvals.length)
end
# Scale these by 100 to turn into a percentage-like measure
relative_entropy([4,4,4,4]) # => 1.0
relative_entropy([4,0,2,1]) # => 0.6893917467430877
relative_entropy([16,0,0,0]) # => 0.0

R code iteration blues

I am very new to R, trying to use it to visualize things. To make the story short, I'm exploring a conjecture I have on the economic theory of public goods. (I'm in my mid 50s, so bear with me.)
As far as R is concerned, I need to create a matrix with two vectors, one with E(W)/max(W), and the 2nd vector with stdev(W)/E(W). The trick is that the sample space of W, my r.v., keeps expanding by 1. To make this clearer, here's the probability distribution of W, the first 4 iterations:
W p
0 2/3
1 1/3
W p
0 3/6
1 2/6
2 1/6
W p
0 4/10
1 3/10
2 2/10
3 1/10
W p
0 5/15
1 4/15
2 3/15
3 2/15
4 1/15
...
I need to iterate this 20 times or so. Of course, I could do this manually, by copying, pasting, and then manually adjusting simple code, but it'd be too bulky and ugly looking, and I'm a bit concerned about --- you know --- elegance.
With good help from this community, I learned how to program R to generate the denominator of the probabilities:
R code iteration
I thought (foolishly) I could take it from there, but after a few hours scratching my bald head, I'm still stuck without knowing how to get what I want. It's about my not understanding well how to program less simple procedures that iterate. :/
I'll appreciate any help, especially clues setting me on the right track.
Thanks in advance!
You're just diving out by the sum; and sum of 1 to k is k*(k+1)/2. So...
R>k <- 3
R>k:1 / (k^2 + k)*2
I assume you mean that you want a matrix of 20 or so rows with each row being the values of your two requested quantities given that the distribution has max(W) = N values.
t(vapply(seq_len(20) + 1, function(N) {
W <- seq(N, 1) / (N * (N + 1) / 2) # create your distribution w/ 20 values
E <- function(pdf) sum((seq_along(pdf) - 1) * pdf)
c(E(W) / max(W), sqrt(E((W - E(W))^2)) / E(W))
}, numeric(2)))

Calculate the length of a segment of a quadratic bezier

I use this algorithm to calculate the length of a quadratic bezier:
http://www.malczak.linuxpl.com/blog/quadratic-bezier-curve-length/
However, what I wish to do is calculate the length of the bezier from 0 to t where 0 < t < 1
Is there any way to modify the formula used in the link above to get the length of the first segment of a bezier curve?
Just to clarify, I'm not looking for the distance between q(0) and q(t) but the length of the arc that goes between these points.
(I don't wish to use adaptive subdivision to aproximate the length)
Since I was sure a similar form solution would exist for that variable t case - I extended the solution given in the link.
Starting from the equation in the link:
Which we can write as
Where b = B/(2A) and c = C/A.
Then transforming u = t + b we get
Where k = c - b^2
Now we can use the integral identity from the link to obtain:
So, in summary, the required steps are:
Calculate A,B,C as in the original equation.
Calculate b = B/(2A) and c = C/A
Calculate u = t + b and k = c -b^2
Plug these values into the equation above.
[Edit by Spektre] I just managed to implement this in C++ so here the code (and working correctly matching naively obtained arc lengths):
float x0,x1,x2,y0,y1,y2; // control points of Bezier curve
float get_l_analytic(float t) // get arclength from parameter t=<0,1>
{
float ax,ay,bx,by,A,B,C,b,c,u,k,L;
ax=x0-x1-x1+x2;
ay=y0-y1-y1+y2;
bx=x1+x1-x0-x0;
by=y1+y1-y0-y0;
A=4.0*((ax*ax)+(ay*ay));
B=4.0*((ax*bx)+(ay*by));
C= (bx*bx)+(by*by);
b=B/(2.0*A);
c=C/A;
u=t+b;
k=c-(b*b);
L=0.5*sqrt(A)*
(
(u*sqrt((u*u)+k))
-(b*sqrt((b*b)+k))
+(k*log(fabs((u+sqrt((u*u)+k))/(b+sqrt((b*b)+k)))))
);
return L;
}
There is still room for improvement as some therms are computed more than once ...
While there may be a closed form expression, this is what I'd do:
Use De-Casteljau's algorithm to split the bezier into the 0 to t part and use the algorithm from the link to calculate its length.
You just have to evaluate the integral not between 0 and 1 but between 0 and t. You can use the symbolic toolbox of your choice to do that if you're not into the math. For instance:
http://integrals.wolfram.com/index.jsp?expr=Sqrt\[a*x*x%2Bb*x%2Bc\]&random=false
Evaluate the result for x = t and x = 0 and subtract them.

Resources