How to select top N values and sum them in mathematical optimization - math

I have an array of values (probabilities) as below,
a = [0, 0.1, 0.9, 0.2, 0, 0.8, 0.7, 0]
To select the maximum value from this array I can use the generalized mean equation. But what should I do if I want to select say top N values and sum them?
e.g. summing top 2 values will give me 0.9 + 0.8 = 1.7
**** But I dont need an implementation/algorithm for it. I need a mathematical equation (e.g. generalized mean for selecting the max value), so that I want to optimize a function which includes this selection.

Effective way to get top K values is using of some K-select algorithm.
For example, fast and reliable one is Quickselect (based on the same partition procedure as QuickSort), which has average complexity O(N).
generalized mean with regard to maximum seems pure mathematical conception, doesn't it?
So I doubt that sum of infinite powers is applicable in real life (programming).

Related

Can we test if distance matrices are significantly farther apart?

I work in the field of linguistics, and my current project involves lots of distance matrices generated from language data, which measure the distance and similarity of dialects. Concretely, my distance matrices range between 0 and 1, where 0 represents no distance and 1 represents maximal distance between dialects. Now, I am wondering if there exists statistical significance tests or something like that, wherein we can test if dialect A and dialect B are significantly farther apart than B and C? Alternatively, is there a customary threshold, say 0.5, whereby distances > 0.5 indicates dialects are more different than similar? For instance, consider the distances between CMM_Press and CTM_Press on one hand, and CMM_Other and CTM_Other on the other hand in the distance matrices below.
In the four distance matrices above, I am especially interested in the distances of the following pairs:
CMM_Press and CTM_Press: 0.2, 0.19, 0.5, 0.4;
CMM_Other and CTM_Other: 0.6, 0.41, 0.4, 0.69.
Is there significance tests with which I can test if, say, CMM_Other and CTM_Other are significantly farther apart than CMM_Press and CTM_Press?
To facilitate the answering of the questions, you can find the dataset, a R markdown file containing the distance matrices and the scripts for the analysis in this OSF link.
In addition, I would like to know if is there exists a good reference on how to interpret distance matrices (e.g., from an ecology point of view where the Mantel test was invented).

Generating Integer Sequences based on a Modified Bernoulli Distribution

I want to use R to randomly generate an integer sequence that each integer is picked from a pool of integers (0,1,2,3....,k) with replacement. k is pre-determined. The selection probability for every integer k in (0,1,2,3....,k) is pk(1-p) where p is pre-determined. That is, 1 has much higher probability to be picked compared to k and my final integer sequence will likely have more 1 than k. I am not sure how to implement this number selecting process in R.
A generic approach to this type of problem would be:
Calculate the p^k * (1-p) for each integer
Create a cumulative sum of these in a table t.
Draw a number from a uniform distribution with range(t)
Measure how far into t that number falls and check which integer that corresponds to.
The larger the probability for an integer is, the larger part of that range it will cover.
Here's quick and dirty example code:
draw <- function(n=1, k, p) {
v <- seq( 0, k )
pr <- (p ** v) * (1-p)
t <- cumsum(pr)
r <- range(t)
x <- runif( n, min=min(r), max=max(r) )
f <- findInterval( x, vec=t )
v[ f+1 ] ## first interval is 0, and it will likely never pass highest interval
}
Note, the proposed solution doesn't care if your density function adds up to 1. In real life it likely will, based on your description. But that's not really important for the solution.
The answer by Sirius is good. But as I can tell, what you're describing is something like a truncated geometric distribution.
I should note that the geometric distribution is defined differently in different works (see MathWorld, for example), so we use the distribution defined as follows:
P(X = x) ~ p^x * (1 - p), where x is an integer in [0, k].
I am not very familiar with R, but the solution involves calling rgeom(1, 1 - p) until the result is k or less.
Alternatively, you can use a generic rejection sampler, since the probabilities are known (better called weights here, since they need not sum to 1). Rejection sampling is described as follows:
Assume and each weight is 0 or greater. Store the weights in a list. Calculate the highest weight, call it max. Then, to choose an integer in the interval [0, k] using rejection sampling:
Choose a uniform random integer i in the interval [0, k].
With probability weights[i]/max (where weights[i] = p^i * (1-p) in your case), return i. Otherwise, go to step 1.
Given the weights for each item, there are many other ways to make a weighted choice besides rejection sampling or the solution in Sirius's answer; see my note on weighted choice algorithms.

How do I write a mathematic formula that non-linearly scales a continuous variable to a 0-100 span where f(x)→100 where x→Inf

I am building an index of user-qualities defined as a sum of (often) correlated continious variables representing user-activity. The index is well-calibrated, and servs the purpos of my analysis, but is tricky to communicate to my co-workers, particularly, since outlier activities cause extremely tenatious users to score a very highly on the activity index.
For 97% of users, the index is distributed near-normally between 0 and 100, with a right tail of 3% of hyper-active users with an index > 100. Index-values beyond 200 should be extremely rare but are theoretically possible.
I'm looking to scale the tail back into a 0-100 span, but not linearly, since I would like the 3%-tail to be represented as small variances within the top-range of the 0-100 index. What I'm looking for a non-linear formula to scale my index, like this:
so that the lower tier of the unscaled index remains close to the scaled one, but where high index-values diverge, but where scaled values never reach 100 as my index goes towards infinity, so that x=0=f(x) but when x = 140, f(x) ≈ 99 or something similar
I'll implement the scaling in R, Python and BigQuery.
There are lots of ways to do this: take any function with the right shape and tweak it to your needs.
One family of functions with the right shape is
f(x) = x/pow(1 + pow(x/100, n), 1/n)
You can vary the parameter n to adjust the shape: increasing n pushes f(100) closer to 100. With n=5 you get something that looks pretty close to your drawing
f(x) = x/pow(1 + pow(x/100, 5), 0.2)
Another option is taking the hyperbolic tangent function tanh which you can of course tweak in similar ways:
f(x) = 100*pow(tanh(pow(x/100, n)), 1/n)
here's the curve with n=2:

How can I get a more precise value from a list of weighted numbers?

A is a list of increasing fixed values (frequencies). It doesn't step evenly but the values never change.
A = 5, 10, 17, 23, 30
Each value in A is weighted by the corresponding value in list B (volume).
B = 2.2, 3.5, 4.4, 3.2, 1.1
I want to calculate the loudest frequency (A). The problem is that the loudest frequency may be 14 but I can't tell from this data set. How can I calculate, based on list B, what the loudest frequency might be in list A?
Here's a rough outline of a solution: I haven't nutted out all the maths for you, but I hope it helps.
Approximate the frequency amplitude using interpolatory splines.
This will give you the function between each adjacent pair of frequency sample points as a sum of basis functions for the frequency values surrounding the pair.
This means you have a function f(x) defined on each interval.
f(x) = A phi_0(x) + B phi_1(x) + C phi_2(x) + D phi_3(x)
At the maximum
0 = f'(x) = A phi_0'(x) + B phi_1(x) + C phi_2(x) + D phi_3(x)
If you're using a cubic spline interpolation, the derivate will be quadratic in x. And thus you can obtain 2 potential extremums for each interval.
Scan through all the intervals, calculate that extremum. Check if it falls inside the interval .. if it doesn't its not really a potential extremum. You now have a list of all the potential internal maxima. Add to this list the values at each node. The maximum from this list will be the maximum value of the interpolatory spline.
You have not been terribly clear here (IMHO). I don't know what it means to "weight" values in A by B. Do we assume we are to treat B as a function of A? Therefore, we are looking for the frequency (A) such that B attains its maximum value, AS A FUNCTION OF A?
If so, this is simply a problem of interpolation, then maximizing the interpolant. Depending on what tools you have available to you, you might do a spline interpolation, as shown in this figure. Then it would be merely a problem of finding the maximum value of that spline.
This spline model suggests the maximum value is Bmax=4.4132, which occurs at A=16.341.
Alternatively, one might simply fit an interpolating polynomial through the points. Your curve is not that noisy that a 4th degree polynomial will be ill-posed. (Had you more points, a high order polynomial would be a terrible idea. Then you might use a piecewise Lagrange interpolant.) Done in MATLAB,
>> P = polyfit(A,B,4)
P =
6.6992e-05 -0.0044803 0.084249 -0.34529 2.3384
I'll plot the polynomial itself.
>> ezplot(#(x) polyval(P,x),[5,30])
We can find the maximum value by looking for a root (zero value) of the derivative function. Since the derivative is a cubic polynomial, there are three roots. Only one of them is of interest.
>> roots(polyder(P))
ans =
31.489
16.133
2.5365
The root of interest is at 16.133, which is consistent with the prediction from the interpolating spline.

Combining two normal random variables

suppose I have the following 2 random variables :
X where mean = 6 and stdev = 3.5
Y where mean = -42 and stdev = 5
I would like to create a new random variable Z based on the first two and knowing that : X happens 90% of the time and Y happens 10% of the time.
It is easy to calculate the mean for Z : 0.9 * 6 + 0.1 * -42 = 1.2
But is it possible to generate random values for Z in a single function?
Of course, I could do something along those lines :
if (randIntBetween(1,10) > 1)
GenerateRandomNormalValue(6, 3.5);
else
GenerateRandomNormalValue(-42, 5);
But I would really like to have a single function that would act as a probability density function for such a random variable (Z) that is not necessary normal.
sorry for the crappy pseudo-code
Thanks for your help!
Edit : here would be one concrete interrogation :
Let's say we add the result of 5 consecutives values from Z. What would be the probability of ending with a number higher than 10?
But I would really like to have a
single function that would act as a
probability density function for such
a random variable (Z) that is not
necessary normal.
Okay, if you want the density, here it is:
rho = 0.9 * density_of_x + 0.1 * density_of_y
But you cannot sample from this density if you don't 1) compute its CDF (cumbersome, but not infeasible) 2) invert it (you will need a numerical solver for this). Or you can do rejection sampling (or variants, eg. importance sampling). This is costly, and cumbersome to get right.
So you should go for the "if" statement (ie. call the generator 3 times), except if you have a very strong reason not to (using quasi-random sequences for instance).
If a random variable is denoted x=(mean,stdev) then the following algebra applies
number * x = ( number*mean, number*stdev )
x1 + x2 = ( mean1+mean2, sqrt(stdev1^2+stdev2^2) )
so for the case of X = (mx,sx), Y= (my,sy) the linear combination is
Z = w1*X + w2*Y = (w1*mx,w1*sx) + (w2*my,w2*sy) =
( w1*mx+w2*my, sqrt( (w1*sx)^2+(w2*sy)^2 ) ) =
( 1.2, 3.19 )
link: Normal Distribution look for Miscellaneous section, item 1.
PS. Sorry for the wierd notation. The new standard deviation is calculated by something similar to the pythagorian theorem. It is the square root of the sum of squares.
This is the form of the distribution:
ListPlot[BinCounts[Table[If[RandomReal[] < .9,
RandomReal[NormalDistribution[6, 3.5]],
RandomReal[NormalDistribution[-42, 5]]], {1000000}], {-60, 20, .1}],
PlotRange -> Full, DataRange -> {-60, 20}]
It is NOT Normal, as you are not adding Normal variables, but just choosing one or the other with certain probability.
Edit
This is the curve for adding five vars with this distribution:
The upper and lower peaks represent taking one of the distributions alone, and the middle peak accounts for the mixing.
The most straightforward and generically applicable solution is to simulate the problem:
Run the piecewise function you have 1,000,000 (just a high number) of times, generate a histogram of the results (by splitting them into bins, and divide the count for each bin by your N (1,000,000 in my example). This will leave you with an approximation for the PDF of Z at every given bin.
Lots of unknowns here, but essentially you just wish to add the two (or more) probability functions to one another.
For any given probability function you could calculate a random number with that density by calculating the area under the probability curve (the integral) and then generating a random number between 0 and that area. Then move along the curve until the area is equal to your random number and use that as your value.
This process can then be generalized to any function (or sum of two or more functions).
Elaboration:
If you have a distribution function f(x) which ranges from 0 to 1. You could calculate a random number based on the distribution by calculating the integral of f(x) from 0 to 1, giving you the area under the curve, lets call it A.
Now, you generate a random number between 0 and A, let's call that number, r. Now you need to find a value t, such that the integral of f(x) from 0 to t is equal to r. t is your random number.
This process can be used for any probability density function f(x). Including the sum of two (or more) probability density functions.
I'm not sure what your functions look like, so not sure if you are able to calculate analytic solutions for all this, but worse case scenario, you could use numeric techniques to approximate the effect.

Resources