Someone just asked why sum(myfloats) differed from sum(reversed(myfloats)). Quickly got duped to Is floating point math broken? and deleted.
But it made me curious: How many different sums can we get from very few floats, just by summing them in different orders? With three floats, we can get three different sums:
>>> from itertools import permutations
>>> for perm in permutations([0.2, 0.3, 0.4]):
print(perm, sum(perm))
(0.2, 0.3, 0.4) 0.9
(0.2, 0.4, 0.3) 0.9000000000000001
(0.3, 0.2, 0.4) 0.9
(0.3, 0.4, 0.2) 0.8999999999999999
(0.4, 0.2, 0.3) 0.9000000000000001
(0.4, 0.3, 0.2) 0.8999999999999999
I believe addition is commutative (i.e., a + b == b + a) for floats. And we have three choices for the first pair to add and then one "choice" for the second add, so three sums is the most we can get with just three values.
Can we get more than three different sums with four values? With some experiments I didn't find such a case. If we can't: why not? If we can: how many? How many with five?
As Eric just pointed out, for more than three values there are also different possibilities than just summing left to right, for example (a+b) + (c+d). I'm interested in any way to add the numbers.
Note I'm talking about 64-bit floats (I'm a Python guy, I know in other languages they're often called doubles).
You most certainly can. For your specific question:
Can we get more than three different sums with four values?
Here's one set of values to show this is indeed the case:
v0 = -1.5426605224883486e63
v1 = 7.199082438280276e62
v2 = 8.227522786603223e62
v3 = -1.4272476927059597e45
print (v0 + v2 + v1 + v3)
print (v3 + v1 + v0 + v2)
print (v2 + v1 + v0 + v3)
print (v1 + v2 + v3 + v0)
When I run this, I get:
1.36873053731e+48
1.370157785e+48
1.46007438964e+48
1.46150163733e+48
all of which are different.
With 5, here's an example set:
v0 = -8.016918059381093e-292
v1 = -0.0
v2 = 2.4463434328110855e-296
v3 = 8.016673425037811e-292
v4 = 1.73833895195875e-310
print(v4 + v1 + v0 + v2 + v3)
print(v2 + v3 + v0 + v1 + v4)
print(v4 + v3 + v1 + v0 + v2)
print(v1 + v0 + v2 + v3 + v4)
print(v1 + v4 + v2 + v0 + v3)
This prints:
-8.90029543403e-308
1.73833895196e-310
-4.45041933248e-308
-8.88291204451e-308
0.0
again, with all different outcomes.
I wouldn't be surprised if you could find values with n values for any sufficiently large n (perhaps n>2 is enough) such that all different permutations would produce different sums; modulo those that are equivalent due to commutativity. (Though of course this is a conjecture.) With catastrophic cancelation, you can arrange for the outcome to differ largely.
If you have N floats you have N! possible sequences how to add them. Assuming you always do one addition at a time this leads to 0.5[N*(N-1)]*(N-2)! versions of the calculation. (assuming a+b=b+a for the first pair and a+b has the same rounding as b+a).
After each step there is a rounding error.
As long as you dont force the calculation sequence I would expect N!/2 possible results due to rounding errors.
The differences however should be in the lowest significant digits as long as you work with pure positive numbers.
If you do things like substracting very similar numbers (example:9999999.9+(-9999999.8) the rounding errors can become critcal however (depending on the size of your floats).
Using brackets you can get even more combinations (Example: (a+b)+(c+d) may differ from ((a+b)+c)+d because you force a rounding with each pair of brackets.
Related
I'm trying to figure out how to calculate the cosine similarity of these two vectors:
A:(1,1,0,0,0,0,0,0,0)
B:(1,0,0,1,0,0,0,0,1)
From what I understand, I need to multiply A by B and then divide it by the length of A*B.
The first part I understand, but how do I know what the length is?
A is a document with 11 words
B is a query with 7 words
Does the length refer to the number of words? Or do I have to 'normalize' the vectors? I'm unsure because, from what I understood, the cosine already normalizes the vector.
Any helps and hints would be appreciated.
Vector operations are defined on elements of a vector space (the dimensionality of which needs to be predefined).
So, even though we loosely use the terminology sparse vector, it's nothing but an efficient way to represent a dense vector...
For your example involving text vectors - both a document and a query are nothing but dense vectors (the dimensionality is the size of the entire vocabulary).
The first step is to convert a document and a query to a dense vector representation.
If D=<the cat sat on the mat> and Q=<cat on mat>,
then the vocabulary (set of unique words) is {cat, mat, on, sat, the}
Our toy vector space is hence of dimensionality 5. Each vector is hence represented by 5 numbers, where the value represents presence/absence or count of the corresponding term.
vec(D) = (1, 1, 1, 1, 2) - because cat occurs once, the occurs twice in D and so on.
Similarly, vec(Q) = (1, 1, 1, 0, 0) - note the 0's corresponding to the terms that are not present, e.g. the term the.
Cosine similarity is the normalized inner product.
The numerator is simply \sum_i a_i * b_i, which for this example is
1.1 + 1.1 + 1.1 + 1.0 + 2.0 = 3
What about the lengths? For finding the length (specifically L2 norm) of a vector simply compute its inner product with itself and then take a square root.
Len(D) = sqrt(1.1 + 1.1 + 1.1 + 1.1 + 2.2) = 2sqrt(2)
Len(Q) = sqrt(1.1 + 1.1 + 1.1 + 0.0 + 0.0) = sqrt(3)
Hence cosine-sim = 3/(2sqrt(2)*sqrt(3))
Finally, for your example,
A:(1,1,0,0,0,0,0,0,0) and B:(1,0,0,1,0,0,0,0,1),
A.B = 1
Len(A) = sqrt(A.A) = sqrt(2)
Len(B) = sqrt(A.A) = sqrt(3)
cosine-sim = 1/sqrt(6) = 0.4082
I am sure everyone is familiar with the famous 21 matchsticks game where each person picks up 1,2 or 3 matches and the last person to pick up a match loses.
Let's simplify the game and assume that it is only possible to pick 1 or 2 matches. My question is, how many games are possible?
I know this is very easy to solve recursively, however, I am trying to come up with a combinatorial solution.
To provide an example, let's reduce 21 to just 4 matches. The number of possible games would be 5. {'MCM', 'MMMM', 'CC', 'CMM', 'MMC'}. Where C represents removing 2 matches and M represents removing a single match.
Symbolic method allows us to deduce that the generating function for this combinatorial class is
f(z) = 1/(1 - z - z^2 - z^3)
At this point, we can obtain the answer through a power series expansion, e.g. see here. The coefficient on z^21 will give the number of possible games in "21 matchsticks" (it might be 233317).
Looking back, suppose that players were allowed to take one match only. Then, there would be only one possible scenario. For each game length (power of z), there is only one game outcome:
1/(1 - z) = 1*1 + 1*z + 1*z^2 + 1*z^3 + 1*z^4 + 1*z^5 + ...
If players are allowed to take one or two matches, we have multiple scenarios:
1/(1 - z - z^2) = 1*1 + 1*z + 2*z^2 + 3*z^3 + 5*z^4 + 8*z^5 + ...
The coefficients recover the Fibonacci sequence and can be interpreted as a number of integer compositions of n using only numbers 1 and 2.
Allowing for taking one, two or three matches leads to the following expansion,
1/(1 - z - z^2 - z^3) = 1*1 + 1*z + 2*z^2 + 4*z^3 + 7*z^4 + 13*z^5 + ...
which can be found in this OEIS sequence, cordially named the "Tribonacci numbers".
It is possible to arrive at the 233317 answer using pen, paper and a shifted generalization of the Pascal triangle, although I would leave that task to someone else.
As an aside, I highly recommend the book "Analytic Combinatorics" by Philippe Flajolet and Robert Sedgewick for their introduction to the symbolic method and beyond.
I'm writing a Python script to generate problems for mental arithmetic drills. The addition and multiplication ones were easy, but I'm running into trouble trying to generate unbiased problems for the subtraction ones.
I want to be able to specify a minimum and maximum value that the minuend (first number) will be -- e.g., for two-digit subtraction it should be between 20 and 99. The subtrahend should also have a range option (11-99, say). The answer needs to be positive and preferably also bounded by a minimum of, say, 10 for this situation.
So:
20 < Minuend < 99
11 < Subtrahend < 99
Answer = Minuend - Subtrahend
Answer >= 10
All the numeric values should be used as variables, of course.
I have these conditions met as follows:
ansMin, ansMax = 10, 99
subtrahendMin, minuendMax = 11,99
# the other max and min did not seem to be necessary here,
# and two ranges was the way I had the program set up
answer = randint(ansMin, ansMax)
subtrahend = randint(subtrahendMin, minuendMax - answer)
minuend = answer + subtrahend # rearranged subtraction equation
The problem here is that the minuend values wind up being nearly all over 50 because the answer and subtrahend were generated first and added together, and only the section of them that were both in the bottom 25% of the range will get the result below 50%. (Edit: that's not strictly true -- for instance, bottom 1% plus bottom 49% would work, and percentages are a bad way of describing it anyway, but I think the idea is clear.)
I also considered trying generating the minuend and subtrahend values both entirely randomly, then throwing out the answer if it didn't match the criteria (namely, that the minuend be greater than the subtrahend by a value at least greater than the answerMin and that they both be within the criteria listed above), but I figured that would result in a similar bias.
I don't care about it being perfectly even, but this is too far off. I'd like the minuend values to be fully random across the allowable range, and the subtrahend values random across the range allowed by the minuends (if I'm thinking about it right, this will be biased in favor of lower ones). I don't think I really care about the distribution of the answers (as long as it's not ridiculously biased). Is there a better way to calculate this?
There are several ways of defining what "not biased" means in this case. I assume that what you are looking for is that every possible subtraction problem from the allowed problem space is chosen with equal probability. Quick and dirty approach:
Pick random x in [x_min, x_max]
Pick random y in [y_min, y_max]
If x - y < answer_min, discard both x and y and start over.
Note the bold part. If you discard only y and keep the x, your problems will have an uniform distribution in x, not in the entire problem space. You need to ensure that for every valid x there is at least one valid y - this is not the case for your original choice of ranges, as we'll see later.
Now the long, proper approach. First we need to find out the actual size of the problem space.
The allowed set of subtrahends is determined by the minuend:
x in [21, 99]
y in [11, x-10]
or using symbolic constants:
x in [x_min, x_max]
y in [y_min, x - answer_min]
We can rewrite that as
x in [21, 99]
y = 11 + a
a in [0, x-21]
or again using symbolic constants
x in [x_min, x_max]
y = y_min + a
a in [0, x - (answer_min + y_min)].
From this, we see that valid problems exist only for x >= (answer_min + y_min), and for a given x there are x - (answer_min + y_min) + 1 possible subtrahents.
Now we assume that x_max does not impose any further constraints, e.g. that answer_min + y_min >= 0:
x in [21, 99], number of problems:
(99 - 21 + 1) * (1 + 78+1) / 2
x in [x_min, x_max], number of problems:
(x_max - x_min + 1) * (1 + x_max - (answer_min + y_min) + 1) / 2
The above is obtained using the formula for the sum of an arithmetic sequence. Therefore, you need to pick a random number in the range [1, 4740]. To transform this number into a subtraction problem, we need to define a mapping between the problem space and the integers. An example mapping is as follows:
1 <=> x = 21, y = 11
2 <=> x = 22, y = 12
3 <=> x = 22, y = 11
4 <=> x = 23, y = 13
5 <=> x = 23, y = 12
6 <=> x = 23, y = 11
and so on. Notice that x jumps by 1 when a triangular number is exceeded. To compute x and y from the random number r, find the lowest triangular number t greater than or equal to r, preferably by searching in a precomputed table; write this number as q*(q+1)/2. Then x = x_min + q-1 and y = y_min + t - r.
Complete program:
import random
x_min, x_max = (21, 99)
y_min = 11
answer_min = 10
triangles = [ (q*(q+1)/2, q) for q in range(1, x_max-x_min+2) ]
upper = (x_max-x_min+1) * (1 + x_max - (answer_min + y_min) + 1) / 2
for i in range(0, 20):
r = 1 + random.randrange(0, upper)
(t, q) = next(a for a in triangles if a[0] >= r)
x = x_min + q - 1
y = y_min + t - r
print "%d - %d = ?" % (x, y)
Note that for a majority of problems (around 75%), x will be above 60. This is correct, because for low values of the minuend there are fewer allowed values of the subtrahend.
I can see a couple of issues with your starting values - if you want the answer to always be greater than 10 - then you need to either increase MinuendMin, or decrease SubtrahendMin because 20-11 is less than 10... Also you have defined the answer min and max as 3,9 - which means the answer will never be more than 10...
Apart from that I managed to get a nice even distribution of values by selecting the minuend value first, then selecting the subtrahend value based on it and the answerMin:
ansMin = 10
minuendMin, minuendMax = 20,99
subtrahendMin = 9;
minuend = randint(minuendMin, minuendMax )
subtrahend = randint(subtrahendMin,(minuend-ansMin) )
answer = minuend - subtrahend
You say you've already got addition working properly. Assuming you have similar restrictions for the addends/sum you could rearrange the factors so that:
minuend <= sum
subtrahend <= first addend
answer <= second addend
A similar mapping can be made for multiplication/division, if required.
I have a value, for example 2.8. I want to find 10 numbers which are on an exponential curve, which sum to this value.
That is, I want to end up with 10 numbers which sum to 2.8, and which, when plotted, look like the curve below (exponential decay). These 10 numbers should be equally spaced along the curve - that is, the 'x-step' between the values should be constant.
This value of 2.8 will be entered by the user, and therefore the way I calculate this needs to be some kind of algorithm that I can program (hence asking this on SO not Math.SE).
I have no idea where to start with this at all - any ideas?
You want to have 10 x values equally distributed, i.e. x_k = a + k * b. They shall fulfill sum(exp(-x_k)) = v with v being your target value (the 2.8). This means exp(-a) * sum(exp(-b)^k) = v.
Obviously, there is a solution for each choice of b if v is positive. Set b to an arbitrary value, and calculate a from it.
E.g. for v = 2.8 and b = 0.1, you get a = -log(v / sum(exp(-b)^k)) = -log(2.8/sum(0.90484^k)) = -log(2.8/6.6425) = -log(0.421526) = 0.86387.
So for this example, the x values would be 0.86387, 0.96387, ..., 1.76387 and the y values 0.421526, 0.381412, 0.345116, 0.312274, 0.282557, 0.255668, 0.231338, 0.209324, 0.189404, 0.171380.
Update:
As it has been clarified that the curve can be scaled arbitrarily and the xs are preferred to be 1, 2, 3 ... 9, this is much more simple.
Assuming the curve function is r*exp(-x), the 10 values would be r*exp(-1) ... r*exp(-9). Their sum is r*sum(exp(-x)) = r*0.58190489. So to reach a certain value (2.8) you just have to adjust the r accordingly:
r = 2.8/sum(exp(-x)) = 4.81178294
And you get the 10 values: 1.770156, 0.651204, 0.239565, 0.088131, 0.032422, 0.011927, 0.004388, 0.001614, 0.000594.
If I understand your question correctly then you want to find x which solves the equation
It can be solved as
(just sum numbers as geometric progression)
The equation under RootOf will always have 1 real square different from 1 for 2.8 or any other positive number. You can solve it using some root-finding algorithm (1 is always a root but it does not solve original task). For constant a you can choose any number you like.
After computing the x you can easily calculate 10 numbers as .
I'm going to generalize and assume you want N numbers summing to V.
Since your numbers are equally spaced on an exponential you can write your sum as
a + a*x + a*x^2 + ... + a*x^(N-1) = V
Where the first point has value a, and the second a*x etc.
You can take out a factor of a and get:
a ( 1 + x + x^2 + ... + x^(N-1) ) = V
If we're free to pick x then we can solve for a easily
a = V / ( 1 + x + x^2 + .. x^(N-1) )
= V*(x+1)/(x^N-1)
Substituting that back into
a, a*x, a*x^2, ..., a*x^(N-1)
gives the required sequence
Pascal's rule on counting the subset's of a set works great, when the set contains unique entities.
Is there a modification to this rule for when the set contains duplicate items?
For instance, when I try to find the count of the combinations of the letters A,B,C,D, it's easy to see that it's 1 + 4 + 6 + 4 + 1 (from Pascal's Triangle) = 16, or 15 if I remove the "use none of the letters" entry.
Now, what if the set of letters is A,B,B,B,C,C,D? Computing by hand, I can determine that the sum of subsets is: 1 + 4 + 8 + 11 + 11 + 8 + 4 + 1 = 48, but this doesn't conform to the Triangle I know.
Question: How do you modify Pascal's Triangle to take into account duplicate entities in the set?
It looks like you want to know how many sub-multi-sets have, say, 3 elements. The math for this gets very tricky, very quickly. The idea is that you want to add together all of the combinations of ways to get there. So you have C(3,4) = 4 ways of doing it with no duplicated elements. B can be repeated twice in C(1,3) = 3 ways. B can be repeated 3 times in 1 way. And C can be repeated twice in C(1,3) = 3 ways. For 11 total. (Your 10 you got by hand was wrong. Sorry.)
In general trying to do that logic is too hard. The simpler way to keep track of it is to write out a polynomial whose coefficients have the terms you want which you multiply out. For Pascal's triangle this is easy, the polynomial is (1+x)^n. (You can use repeated squaring to calculate this more efficiently.) In your case if an element is repeated twice you would have a (1+x+x^2) factor. 3 times would be (1+x+x^2+x^3). So your specific problem would be solved as follows:
(1 + x) (1 + x + x^2 + x^3) (1 + x + x^2) (1 + x)
= (1 + 2x + 2x^2 + 2x^3 + x^4)(1 + 2x + 2x^2 + x^3)
= 1 + 2x + 2x^2 + x^3 +
2x + 4x^2 + 4x^3 + 2x^4 +
2x^2 + 4x^3 + 4x^4 + 2x^5 +
2x^3 + 4x^4 + 4x^5 + 2x^6 +
x^4 + 2x^5 + 2x^6 + x^7
= 1 + 4x + 8x^2 + 11x^3 + 11x^4 + 8x^5 + 4x^6 + x^7
If you want to produce those numbers in code, I would use the polynomial trick to organize your thinking and code. (You'd be working with arrays of coefficients.)
A set only contains unique items. If there are duplicates, then it is no longer a set.
Yes, if you don't want to consider sets, consider the idea of 'factors.' How many factors does:
p1^a1.p2^a2....pn^an
have if p1's are distinct primes. If the ai's are all 1, then the number is 2^n. In general, the answer is (a1+1)(a2+1)...(an+1) as David Nehme notes.
Oh, and note that your answer by hand was wrong, it should be 48, or 47 if you don't want to count the empty set.
You don't need to modify Pascal's Triangle at all. Study C(k,n) and you'll find out -- you basically need to divide the original results to account for the permutation of equivalent letters.
E.g., A B1 B2 C1 D1 == A B2 B1 C1 D1, therefore you need to divide C(5,5) by C(2,2).
Without duplicates (in a set as earlier posters have noted), each element is either in or out of the subset. So you have 2^n subsets. With duplicates, (in a "multi-set") you have to take into account the number the number of times each element is in the "sub-multi-set". If it m_1,m_2...m_n represent the number of times each element repeats, then the number of sub-bags is (1+m_1) * (1+m_2) * ... (1+m_n).
Even though mathematical sets do contain unique items, you can run into the problem of duplicate items in 'sets' in the real world of programming. See this thread on Lisp unions for an example.