Calculate weighted mean point from array of 3D points - vector

I am writing the program in Cocoa but I think that the solution must be quite universal.
I have a set of points represented by 3D vectors. Each point has a weight assigned to it. The weight is in the range from 0 to 1. The sum of all weights isn't equal to 1.
How should the weighted mean point be calculated from such set?
Either programmatic or pure mathematical solution will be helpful.
Of course if Cocoa has some specific tools for solving this task, I would very appreciate this information.

Simply sum all vectors scaled by their weight. Finally, divide by the sum of all weights. This has the same effect as first normalizing all weights to sum to 1.
Pseudo-code:
sum = [0, 0, 0]
totalWeights = 0
for each point p with associated weight w:
sum += p * w
totalWeights += w
mean = sum / totalWeights

Related

Coefficient of Euclidean Distance

I have been trying to calculate correlation coefficient (say r) and euclidean distance (say d) between two random variables X and Y. It is known that -1 <= r <= 1, whereas d >= 0. To compare these similarity metrics (mostly for visualization purpose), first I want to calculate a coefficient for d, so that it will be between 0 and 1 or between -1 and 1 like r. One way to scale d is dividing it by its maximum, i.e. d* = d/max(d). However, the maximum max(d) is not a global value and when someone uses different data points for X and Y, then it is not longer comparable to the first one. Therefore, I'm asking this community to suggest me a better way of scaling the euclidean distance that ranges in [0, 1] or [-1, 1].
I appreciate your cooperation in advance.
Alemu

How does distances weighting work in KNN?

I'm writing KNN classifier in R. I want to add weighting scheme, e. g. inverted indices 1/d. As it is, for Iris dataset I get almost perfect 66% accuracy (no matter the metric used) since value no. 3 ("virginica") almost never shows up and I want to make it better with weighting. My question is: what exactly and how do I weight? I've read that I should weight classes of K nearest neighbours with those distances.
I've tried creating vectors of classes and distances to K nearest neighbours and then taking weighted mean from it:
inverted <- function(vals, distances)
{
inv_distances <- 1 / distances
# eliminate division-by-zero errors
inv_distances <- ifelse((inv_distances < 0.01), 0.01, inv_distances)
weighted.mean(vals, inv_distances)
}
My results are weird: for correct vectors vals (classes) and distances I sometimes get NaN (Not a Number) or NA values. Also my weights don't sum to 1, and... they probably should? I'm not sure. I just need someone to clear this weighting scheme for me.
EDIT:
I've debugged above code, since it multiplied by weight too late (therefore not eliminating distance 0 and causing NaNs). I've also changed it to harmonic series weights, not using distance (so first neighbour has weight 1, second 1/2, third 1/3 etc.). I still don't know exactly how it works and what other weights may be.
inverted <- function(vals)
{
weights <- 1 / seq(length(vals))
res <- weighted.mean(vals, weights)
res
}

Probability of selecting exactly n elements

I have a list of about 100 000 probabilities on an event stored in a vector.
I want to know if it is possible to calculate the probability of n occuring events (e.g. what is the probability that exactly 1000 events occur).
I managed to calculate several probabilities in R :
p is the vector containing all the probabilities
probability of none : prod(1-p)
probability of at least one : 1 - prod(1-p)
I found how to calculate the probability of exactly one event :
sum(p * (prod(1-p) / (1-p)))
But I don't know how to generate a formula for n events.
I do not know R, but I know how I would solve this with programming.
This is a straightforward dynamic programming problem. We start with a vector v = [1.0] of probabilities. Then in untested Python:
for p_i in probabilities:
next_v = [p_i * v[0]]
v.append(0.0)
for j in range(len(v) - 1):
next_v.append(v[j]*p_i + v[j+1]*(1-p_i)
# For roundoff errors
total = sum(next_v)
for j in range(len(next_v)):
next_v[j] /= total
v = next_v
And now your answers can be just read off of the right entry in the vector.
This approach is equivalent to calculating Pascal's triangle row by row, throwing away the old row when you're done.

Generate N random integers that are sampled from a uniform distribution and sum to M in R [duplicate]

In some code I want to choose n random numbers in [0,1) which sum to 1.
I do so by choosing the numbers independently in [0,1) and normalizing them by dividing each one by the total sum:
numbers = [random() for i in range(n)]
numbers = [n/sum(numbers) for n in numbers]
My "problem" is, that the distribution I get out is quite skew. Choosing a million numbers not a single one gets over 1/2. By some effort I've calculated the pdf, and it's not nice.
Here is the weird looking pdf I get for 5 variables:
Do you have an idea for a nice algorithm to choose the numbers, that result in a more uniform or simple distribution?
You are looking to partition the distance from 0 to 1.
Choose n - 1 numbers from 0 to 1, sort them and determine the distances between each of them.
This will partition the space 0 to 1, which should yield the occasional large result which you aren't getting.
Even so, for large values of n, you can generally expect your max value to decrease as well, just not as quickly as your method.
You might be interested in the Dirichlet distribution which is used for generate quantities that sum to 1 if you're looking for probabilities. There's also a section on how to generate them using gamma distributions here.
Another way to get n random numbers which sum up to 1:
import random
def create_norm_arr(n, remaining=1.0):
random_numbers = []
for _ in range(n - 1):
r = random.random() # get a random number in [0, 1)
r = r * remaining
remaining -= r
random_numbers.append(r)
random_numbers.append(remaining)
return random_numbers
random_numbers = create_norm_arr(5)
print(random_numbers)
print(sum(random_numbers))
This makes higher numbers more likely.

How can I get a more precise value from a list of weighted numbers?

A is a list of increasing fixed values (frequencies). It doesn't step evenly but the values never change.
A = 5, 10, 17, 23, 30
Each value in A is weighted by the corresponding value in list B (volume).
B = 2.2, 3.5, 4.4, 3.2, 1.1
I want to calculate the loudest frequency (A). The problem is that the loudest frequency may be 14 but I can't tell from this data set. How can I calculate, based on list B, what the loudest frequency might be in list A?
Here's a rough outline of a solution: I haven't nutted out all the maths for you, but I hope it helps.
Approximate the frequency amplitude using interpolatory splines.
This will give you the function between each adjacent pair of frequency sample points as a sum of basis functions for the frequency values surrounding the pair.
This means you have a function f(x) defined on each interval.
f(x) = A phi_0(x) + B phi_1(x) + C phi_2(x) + D phi_3(x)
At the maximum
0 = f'(x) = A phi_0'(x) + B phi_1(x) + C phi_2(x) + D phi_3(x)
If you're using a cubic spline interpolation, the derivate will be quadratic in x. And thus you can obtain 2 potential extremums for each interval.
Scan through all the intervals, calculate that extremum. Check if it falls inside the interval .. if it doesn't its not really a potential extremum. You now have a list of all the potential internal maxima. Add to this list the values at each node. The maximum from this list will be the maximum value of the interpolatory spline.
You have not been terribly clear here (IMHO). I don't know what it means to "weight" values in A by B. Do we assume we are to treat B as a function of A? Therefore, we are looking for the frequency (A) such that B attains its maximum value, AS A FUNCTION OF A?
If so, this is simply a problem of interpolation, then maximizing the interpolant. Depending on what tools you have available to you, you might do a spline interpolation, as shown in this figure. Then it would be merely a problem of finding the maximum value of that spline.
This spline model suggests the maximum value is Bmax=4.4132, which occurs at A=16.341.
Alternatively, one might simply fit an interpolating polynomial through the points. Your curve is not that noisy that a 4th degree polynomial will be ill-posed. (Had you more points, a high order polynomial would be a terrible idea. Then you might use a piecewise Lagrange interpolant.) Done in MATLAB,
>> P = polyfit(A,B,4)
P =
6.6992e-05 -0.0044803 0.084249 -0.34529 2.3384
I'll plot the polynomial itself.
>> ezplot(#(x) polyval(P,x),[5,30])
We can find the maximum value by looking for a root (zero value) of the derivative function. Since the derivative is a cubic polynomial, there are three roots. Only one of them is of interest.
>> roots(polyder(P))
ans =
31.489
16.133
2.5365
The root of interest is at 16.133, which is consistent with the prediction from the interpolating spline.

Resources