How to choose the lengths of my sub sequences for a shell sort? - shellsort

Let's assume we have a sequence a_i of length n and we want to sort it using shell sort. To do so, we would choose sub sequences of out a_i's of length k_i.
I'm now wondering how to choose those k_i's. You usually see that if n=16 we would choose k_1=8, k_2=4, k_3=2, k_4=1. So we would pair-wise compare the number's for each k_i and at the end use insertionSort to finish our sorting.
The idea of first sorting sub sequences of length k_i is to "pre-sort" the sequence for the insertionSort. Right?
Questions:
Now, depending on how we choose our k_i, we get a better performance. Is there a rule I can use here to choose the k_i's?
Could I also choose e.g. n=15, k_1=5, k_2=3, k_3=2?
If we have n=10 and k_1=5, would we now go with {k_2=2, k_3=1} or {k_2=3, k_2=2, k_3=1} or {k_2=3, k_3=1}?

The fascinating thing about shellsort is that for a sequence of n (unique) entries a unique set of gaps will be required to sort it efficiently, essentially f(n) => {gap/gaps}
For example, to most efficiently - on average - sort a sequence containing
2-5 entries - use insertion sort
6 entries - use shellsort with gaps {4,1}
7 or 8 entries - use a {5,1} shellsort
9 entries - use a {6,1} shellsort
10 entries - use a {9,6,1} shellsort
11 entries - use a {10,6,1} shellsort
12 entries - use a {5,1} shellsort
As you can see, 6-9 require 2 gaps, 10 and 11 three and 12 two. This is typical of shellsort's gaps: from one n to the next (i e n+1) you can be fairly sure that the number and makeup of gaps will differ.
A nasty side-effect of shellsort is that when using a set of random combinations of n entries (to save processing/evaluation time) to test gaps you may end up with either the best gaps for n entries or the best gaps for your set of combinations - most likely the latter.
I speculate that it is probably possible to create algorithms where you can plug in an arbitrary n and get the best gap sequence computed for you. Many high-profile computer scientists have explored the relationship between n and gaps without a lot to show for it. In the end they produce gaps (more or less by trial and error) that they claim perform better than those of others who have explored shellsort.
Concerning your foreword given n=16 ...
a {8,4,2,1} shellsort may or may not be an efficient way to sort 16 entries.
or should it be three gaps and, if so, what might they be?
or even two?
Then, to (try to) answer your questions ...
Q1: a rule can probably be formulated
Q2: you could ... but you should test it (for a given n there are n! possible sequences to test)
Q3: you can compare it with the correct answer (above). Or you can test it against all 10! possible sequences when n=10 (comes out to 3628800 of them) - doable

Related

Is it possible to represent 'average value' in programming?

Had a tough time thinking of an appropriate title, but I'm just trying to code something that can auto compute the following simple math problem:
The average value of a,b,c is 25. The average value of b,c is 23. What is the value of 'a'?
For us humans we can easily compute that the value of 'a' is 29, without the need to know b and c. But I'm not sure if this is possible in programming, where we code a function that takes in the average values of 'a,b,c' and 'b,c' and outputs 'a' automatically.
Yes, it is possible to do this. The reason for this is that you can model the sort of problem being described here as a system of linear equations. For example, when you say that the average of a, b, and c is 25, then you're saying that
a / 3 + b / 3 + c / 3 = 25.
Adding in the constraint that the average of b and c is 23 gives the equation
b / 2 + c / 2 = 23.
More generally, any constraint of the form "the average of the variables x1, x2, ..., xn is M" can be written as
x1 / n + x2 / n + ... + xn / n = M.
Once you have all of these constraints written out, solving for the value of a particular variable - or determining that many solutions exists - reduces to solving a system of linear equations. There are a number of techniques to do this, with Gaussian elimination with backpropagation being a particularly common way to do this (though often you'd just hand this to MATLAB or a linear algebra package and have it do the work for you.)
There's no guarantee in general that given a collection of equations the computer can determine whether or not they have a solution or to deduce a value of a variable, but this happens to be one of the nice cases where the shape of the contraints make the problem amenable to exact solutions.
Alright I have figured some things out. To answer the question as per title directly, it's possible to represent average value in programming. 1 possible way is to create a list of map data structures which store the set collection as key (eg. "a,b,c"), while the average value of the set will be the value (eg. 25).
Extract the key and split its string by comma, store into list, then multiply the average value by the size of list to get the total (eg. 25x3 and 23x2). With this, no semantic information will be lost.
As for the context to which I asked this question, the more proper description to the problem is "Given a set of average values of different combinations of variables, is it possible to find the value of each variable?" The answer to this is open. I can't figure it out, but below is an attempt in describing the logic flow if one were to code it out:
Match the lists (from Paragraph 2) against one another in all possible combinations to check if a list contains all elements in another list. If so, substract the lists (eg. abc-bc) as well as the value (eg. 75-46). If upon substracting we only have 1 variable in the collection, then we have found the value for this variable.
If there's still more than 1 variables left such as abcd - bc = ad, then store the values as a map data structure and repeat the process, till the point where the substraction count in the full iteration is 0 for all possible combinations (eg. ac can't substract bc). This is unfortunately not where it ends.
Further solutions may be found by combining the lists (eg. ac + bd = abcd) to get more possible ways to subtract and derive at the answer. When this is the case, you just don't know when to stop trying, and the list of combinations will get exponential. Maybe someone with strong related mathematical theories may be able to prove that upon a certain number of iteration, further additions are useless and hence should stop. Heck, it may even be possible that negative values are also helpful, and hence contradict what I said earlier about 'ac' can't subtract 'bd' (to get a,c,-b,-d). This will give even more combinations to compute.
People with stronger computing science foundations may try what templatetypedef has suggested.

Counting to a million in Python - Theory

I'm learning Python and came across a question that went something like "How long would it take to count to 1,000,000 out loud?" The only parameter it gave was, "you count, on average, 1 digit per second." I did that problem, which wasn't very difficult. Then I started thinking about counting aloud, annunciating each numeral. That parameter seems off to me, and indeed the answer Google gives to the question alone "how long to count to a million" suggests it's off. Given that each number in the sequence takes progressively longer (an exponential increase??), there must be a better way.
Any ideas or general guidance would be of assistance. Would sampling various people's "counting rates" at various intervals work? Would programming the # of syllables work? I am really curious, and have looked all over SO and Google for solutions that don't revolve around that seemingly inaccurate "average time".
Thanks, and sorry if this isn't on topic or in the appropriate place. I'm a long time lurker, but new to posting, so let me know if you need more info or anything. Thanks!
Let us suppose for the sake of simplicity that you don't say 1502 as "fifteen hundred and two", but as "thousand five hundred and two". Then we can hierarchically break it down.
And let's ignore the fact whether you say "and" or not (though apparently it is more said than not) for now. I will use this reference (and British English, because I like it more and it's more consistent : http://forum.wordreference.com/showthread.php?t=15&langid=6) for how to pronounce numbers.
In fact, to formally describe this, let t be a function of a set of numbers, that tells you how much time it takes to pronounce every number in that set. Then your question is how to compute t([1..1000000]), and we will use M=t([1..999])
Triplet time in function of previous one
To read a large number we start at the left and read the three-digit groups. The group at the left, of course, may have only one or two digits.
Thus for every number x of thousands you will say x thousand y where y will describe all the numbers from 1 to 999.
Thus the time you spend in the x thousand ... is 1000 t({1000x}) + M, as detailed here after :
Note that this formula is generalizable to numbers below 1000, by simply defining t({0}) = 0.
Now the time to say "x thousand" is, per our hypothesis, equal to the time to say "x" plus the time to say "thousand" (when x > 0). Thus your answer is :
Where is the time it takes to say the word thousand. This supposes you say 1000 as "one thousand". You may want to remove 1000 tau("one") if you would only say "thousand".
How ever I stick with the reference :
The numbers 100-199 begin with one hundred... or a hundred...
You can in exactly the same way express the time it takes to count to a billion from and the number above, and so on for all the greater powers of 103, i.e.
Taking into account the "and"
There is a small correction to be done. Let us suppose that M is the time it takes to pronounce numbers from 1 to 999 when they are preceded by at least a non-0 group of numbers, including initial "and"s.
Our reference (well, the wordreference post I linked) says the following :
What do we say to join the groups?
Normally, we don’t use any joining word.
The exception is the last group.
If the last group after the thousands is 1-99 it is joined with and.
Thus our correction applies only to the numbers between 0 and 999 (where there is no non-zero group preceding) :
Getting M
Or rather, let's get t([1..999]) since it's more natural and we know how it is related to M.
Let C = t([1..99]), X = t([1..9]).
Between 1 and 999 we have all the numbers from [1..99] and all the 9 exact hundreds where you don't say "and", that is 108 occurences. There are 900 numbers prefixed with a hundreds number.
Thus
C is probably hard to break down, so I'm not going to try.
Final result
The corrected formula is :
And as a function of C and X :
Note that your measures of tau(word), C, and X need to be very precise if you plan on doing this multiplication and having any kind of correct order of magnitude.
Conclusion : Brits end up saying "and" a whole lot. The nice thing about the last formulation is that you can remove all the "and"s if you decide you actually don't want to pronounce them.

How many different partitions with exactly n parts can be made of a set with k-elements?

How many different partitions with exactly two parts can be made of the set {1,2,3,4}?
There are 4 elements in this list that need to be partitioned into 2 parts. I wrote these out and got a total of 7 different possibilities:
{{1},{2,3,4}}
{{2},{1,3,4}}
{{3},{1,2,4}}
{{4},{1,2,3}}
{{1,2},{3,4}}
{{1,3},{2,4}}
{{1,4},{2,3}}
Now I must answer the same question for the set {1,2,3,...,100}.
There are 100 elements in this list that need to be partitioned into 2 parts. I know the largest size a part of the partition can be is 50 (that's 100/2) and the smallest is 1 (so one part has 1 number and the other part has 99). How can I determine how many different possibilities there are for partitions of two parts without writing out extraneous lists of every possible combination?
Can the answer be simplified into a factorial (such as 12!)?
Is there a general formula one can use to find how many different partitions with exactly n parts can be made of a set with k-elements?
1) stackoverflow is about programming. Your question belongs to https://math.stackexchange.com/ realm.
2) There are 2n subsets of a set of n elements (because each of n elements may either be or be not contained in the specific subset). This gives us 2n-1 different partitions of a n-element set into the two subsets. One of these partitions is the trivial one (with the one part being an empty subset and other part being the entire original set), and from your example it seems you don't want to count the trivial partition. So the answer is 2n-1-1 (which gives 23-1=7 for n=4).
The general answer for n parts and k elements would be the Stirling number of the second kind S(k,n).
Please beware that the usual convention is with n the total number of elements, thus S(n,k)
Computing the general formula is quite ugly, but doable for k=2 (with the common notation) :
Thus S(n,2) = 1/2 ( (+1) * 1 * 0n +(-1) * 2 * 1n + (+1) * 1 * 2n ) = (0-2+2n)/2 = 2n-1-1

Quantifying the non-randomness of a specialized random generator?

I just read this interesting question about a random number generator that never generates the same value three consecutive times. This clearly makes the random number generator different from a standard uniform random number generator, but I'm not sure how to quantitatively describe how this generator differs from a generator that didn't have this property.
Suppose that you handed me two random number generators, R and S, where R is a true random number generator and S is a true random number generator that has been modified to never produce the same value three consecutive times. If you didn't tell me which one was R or S, the only way I can think of to detect this would be to run the generators until one of them produced the same value three consecutive times.
My question is - is there a better algorithm for telling the two generators apart? Does the restriction of not producing the same number three times somehow affect the observable behavior of the generator in a way other than preventing three of the same value from coming up in a row?
As a consequence of Rice's Theorem, there is no way to tell which is which.
Proof: Let L be the output of the normal RNG. Let L' be L, but with all sequences of length >= 3 removed. Some TMs recognize L', but some do not. Therefore, by Rice's theorem, determining if a TM accepts L' is not decidable.
As others have noted, you may be able to make an assertion like "It has run for N steps without repeating three times", but you can never make the leap to "it will never repeat a digit three times." More appropriately, there exists at least one machine for which you can't determine whether or not it meets this criterion.
Caveat: if you had a truly random generator (e.g. nuclear decay), it is possible that Rice's theorem would not apply. My intuition is that the theorem still holds for these machines, but I've never heard it discussed.
EDIT: a secondary proof. Suppose P(X) determines with high probability whether or not X accepts L'. We can construct an (infinite number of) programs F like:
F(x): if x(F), then don't accept L'
else, accept L'
P cannot determine the behavior of F(P). Moreover, say P correctly predicts the behavior of G. We can construct:
F'(x): if x(F'), then don't accept L'
else, run G(x)
So for every good case, there must exist at least one bad case.
If S is defined by rejecting from R, then a sequence produced by S will be a subsequence of the sequence produced by R. For example, taking a simple random variable X with equal probability of being 1 or 0, you would have:
R = 0 1 1 0 0 0 1 0 1
S = 0 1 1 0 0 1 0 1
The only real way to differentiate these two is to look for streaks. If you are generating binary numbers, then streaks are incredibly common (so much so that one can almost always differentiate between a random 100 digit sequence and one that a student writes down trying to be random). If the numbers are taken from [0,1] uniformly, then streaks are far less common.
It's an easy exercise in probability to calculate the chance of three consecutive numbers being equal once you know the distribution, or even better, the expected number of numbers needed until the probability of three consecutive equal numbers is greater than p for your favourite choice of p.
Since you defined that they only differ with respect to that specific property there is no better algorithm to distinguish those two.
If you do triples of randum values of course the generator S will produce all other triples slightly more often than R in order to compensate the missing triples (X,X,X). But to get a significant result you'd need much more data than it will cost you to find any value three consecutive times the first time.
Probably use ENT ( http://fourmilab.ch/random/ )

math - function mapping set of integers

I need function that maps any m integers between a and b (where b-a > m) into integers between 0 to m-1. The m integers between a and b may not be in any order. The mapping could be in any order as long as it is one-to-one mapping.
For example I have a set of integers between 10 and 50 and I pick any 10 integers randomly and map them into 0-9. The function could take one, two or three inputs that may different for each set of those 10 integers. And one more thing, it has to be reversible, i.e using those inputs I can get back the original number.
does it exist of such function and is it possible ?
It's fairly easy. Map the smallest number to 0, the second smallest to 1, etc. The map is invertible if and only if you know the set of numbers you began with.
It sounds like you're asking for a minimal perfect hash. Such functions do exist, there are algorithms for finding them, and even preexisting libraries to do the work.

Resources