Increment number stored as array of digit-counters - math

I'm trying to store a counter that can become very large (well over 32 and probably 64-bit limits), but rather than use a single integer, I'd like to store it as an array of counters for each digit. This should be pretty language-agnostic.
In this form, 0 would be [1, 0, 0, 0, 0, 0, 0, 0, 0, 0] (one zero, none of the other digits up to 9). 1 would be [0, 1, 0, ...] and so on. 10 would therefore be [1, 1, 0, ...].
I can't come with a way to keep track of which digits should be decremented (moving from 29 to 30, for example) and how those should be moved. I suspect that it can't be done without another counter, either a single value representing the last cell touched, or an array of 10 more counters to flag when each digit should be touched.
Is it possible to represent a number in this fashion and count up without using a simple integer value?

No, this representation by itself would be useless because it fails to encode digit position, leading to many numbers having the same representation (e.g. 121 and 211).
Either use a bignum library, or 80-bits worth of raw binary (that being sufficient to store your declared range of 10e23)

Related

How to partition UUID space into N equal-size partitions?

Take a UUID in its hex representation: '123e4567-e89b-12d3-a456-426655440000'
I have a lot of such UUIDs, and I want to separate them into N buckets, where N is of my choosing, and I want to generate the bounds of these buckets.
I can trivially create 16 buckets with these bounds:
00000000-0000-0000-0000-000000000000
10000000-0000-0000-0000-000000000000
20000000-0000-0000-0000-000000000000
30000000-0000-0000-0000-000000000000
...
e0000000-0000-0000-0000-000000000000
f0000000-0000-0000-0000-000000000000
ffffffff-ffff-ffff-ffff-ffffffffffff
just by iterating over the options for the first hex digit.
Suppose I want 50 equal size buckets(equal in terms of number of UUID possibilities contained within each bucket), or 2000 buckets, or N buckets.
How do I generate such bounds as a function of N?
Your UUIDs above are 32 hex digits in length. So that means you have 16^32 ≈ 3.4e38 possible UUIDs. A simple solution would be to use a big int library (or a method of your own) to store these very large values as actual numbers. Then, you can just divide the number of possible UUIDs by N (call that value k), giving you bucket bounds of 0, k, 2*k, ... (N-1)*k, UMAX.
This runs into a problem if N doesn't divide the number of possible UUIDs. Obviously, not every bucket will have the same number of UUIDs, but in this case, they won't even be evenly distributed. For example, if the number of possible UUIDs is 32, and you want 7 buckets, then k would be 4, so you would have buckets of size 4, 4, 4, 4, 4, 4, and 8. This probably isn't ideal. To fix this, you could instead make the bucket bounds at 0, (1*UMAX)/N, (2*UMAX)/N, ... ((N-1)*UMAX)/N, UMAX. Then, in the inconvenient case above, you would end up with bounds at 0, 4, 9, 13, 18, 22, 27, 32 -- giving bucket sizes of 4, 5, 4, 5, 4, 5, 5.
You will probably need a big int library or some other method to store large integers in order to use this method. For comparison, a long long in C++ (in some implementations) can only store up to 2^64 ≈ 1.8e19.
If N is a power of 2, then the solution is obvious: you can split on bit boundaries as for 16 buckets in your question.
If N is not a power of 2, the buckets mathematically cannot be of exactly equal size, so the question becomes how unequal are you willing to tolerate in the name of efficiency.
As long as N<2^24 or so, the simplest thing to do is just allocate UUIDs based on the first 32 bits into N buckets each of size 2^32/N. That should be fast enough and equal enough for most applications, and if N needs to be larger than that allows, you could easily double the bits with a smallish penalty.

how do you count the number of results

Write a program that reads a series of numbers, ending with 0, and then tells you how
many numbers you have keyed in (other than the last 0). For example, if you keyed in
the numbers 5, -10, 50, 22, -945, 12, 0 it would output ‘You have entered 6 numbers.’.
doing my homework and can get this one to work
what stumps me is i understand adding the numbers to get the sum total but what do i call the number of numbers ...
thanks
Python has a very simple function that could be used here, string.count(). Given each number is separated by a comma, you can count the amount of commas to get the amount of numbers (not including the 0, which doesn't have a comma after it). An example of this in use would be
input = 5, -10, 50, 22, -945, 12, 0
Number_of_Numbers = input.count(',')

How to create an 8-bit comparator with four 2-bit comparators?

I am designing an 8-bit comparator in Xilinx ISE Project Navigator. My goal is to add four 2-bit comparators, as shown at the picture. The input is a 16-bit literal, of which the first 8 bits are number A, the second are number B (SW(15:8) -> A; SW(7:0) -> B). There are two inputs BTN0 and BTN1, I use BTN0 to give the first comparator the EQ input value 1.
In ISim, the comparison works fine if the two numbers are equal, but gets weird when I try with two different numbers. I am working from several sources and I'm a beginner at all this, so there could be easily a bug/error I didn't think about.
http://25.media.tumblr.com/4e443e33d84b43e80e4f595b0044ab86/tumblr_mjd7vttpuc1r65yueo1_1280.png
I am afraid the 2-bit comparator is not correct. For example, if A1 = 1, A0 = 0, B1 = 0, and B0 = 0, the output of the AND3B1 is 0, and the output of the AND4B1 will also be 0, so AG = 0.

How do you represent data records containing non-numerical features as vectors (mathematical, NOT c++ vector)?

Many data mining algorithms/strategies use vector representation of data records in order to simulate a spatial representation of the data (like support vector machines).
My trouble comes from how to represent non-numerical features within the dataset. My first thought was to 'alias' each possible value for a feature with a number from 1 to n (where n is the number of features).
While doing some research I came across a suggestion that when dealing with features that have a small number of possible values that you should use a bit string of length n where each bit represents a different value and only the one bit corresponding to the value being stored is flipped. I can see how you could theoretically save memory using this method with features that have less possible values than the number of bits used to store an integer value on your target system but the data set I'm working with has many different values for various features so I don't think that solution will help me at all.
What are some of the accepted methods of representing these values in vectors and when is each strategy the best choice?
So there's a convention to do this. It's much easier to show by example than to explain.
Suppose you have have collected from your web analytics app, four sets of metrics describing each visitor to a web site:
sex/gender
acquisition channel
forum participation level
account type
Each of these is a categorical variable (aka factor) rather than a continuous variable (e.g., total session time, or account age).
# column headers of raw data--all fields are categorical ('factors')
col_headers = ['sex', 'acquisition_channel', 'forum_participation_level', 'account_type']
# a single data row represents one user
row1 = ['M', 'organic_search', 'moderator', 'premium_subscriber']
# expand data matrix width-wise by adding new fields (columns) for each factor level:
input_fields = [ 'male', 'female', 'new', 'trusted', 'active_participant', 'moderator',
'direct_typein', 'organic_search', 'affiliate', 'premium_subscriber',
'regular_subscriber', 'unregistered_user' ]
# now, original 'row1' above, becomes (for input to ML algorithm, etc.)
warehoused_row1 = [1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0]
This transformation technique seems more sensible to me than keeping each variable as a single column. For instance, if you do the latter, then you have to reconcile the three types of acquisition channels with their numerical representation--i.e., if organic search is a "1" should affiliate be a 2 and direct_typein a 3, or vice versa?
Another significant advantage of this representation is that it is, despite the width expansion, a compact representation of the data. (In instances where the column expansion is substantial--i.e., one field is user state, which might mean 1 column becomes 50, a sparse matrix representation is obviously a good idea.)
for this type of work i use the numerical computation libraries NumPy and SciPy.
from the Python interactive prompt:
>>> # create two data rows, representing two unique visitors to a Web site:
>>> row1 = NP.array([0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0])
>>> row2 = NP.array([1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0])
>>> row1.dtype
dtype('int64')
>>> row1.itemsize
8
>>> # these two data arrays can be converted from int/float to boolean, substantially
>>> # reducing their size w/ concomitant performance improvement
>>> row1 = NP.array(row1, dtype=bool)
>>> row2 = NP.array(row2, dtype=bool)
>>> row1.dtype
dtype('bool')
>>> row1.itemsize # compare with row1.itemsize = 8, above
1
>>> # element-wise comparison of two data vectors (two users) is straightforward:
>>> row1 == row2 # element-wise comparison
array([False, False, False, False, True, True, False, True, True, False], dtype=bool)
>>> NP.sum(row1==row2)
5
For similarity-based computation (e.g. k-Nearest Neighbors), there is a particular metric used for expanded data vectors comprised of categorical variables called the Tanimoto Coefficient. For the particular representation i have used here, the function would look like this:
def tanimoto_bool(A, B) :
AuB = NP.sum(A==B)
numer = AuB
denom = len(A) + len(B) - AuB
return numer/float(denom)
>>> tanimoto_bool(row1, row2)
0.25
There are no "widely accepted answer" that I know of, it entirely depends on what you want.
The main idea behind your post is that the trivial memory representation of a state may be too memory intensive. For example, to store a value that can have at most four states, you will use an int (32 bits) but you could manage with only 2 bits, so 16 times less.
However, the cleverer your representation of a vector (ie : compact), the longer it will take you to code/decode it from/to the trivial representation.
I did a project where I represented the state of a Connect-4 board with 2 doubles (64 bits), where each double coded the discs owned by each player. It was a vast improvement over storing the state as 42 integers! I could explore much farther by having a smaller memory footprint. This is typically what you want.
It is possible through clever understanding of the Connect-4 to code it with only one double! I tried it, and the program became sooo long that I reverted to using 2 doubles instead of one. The program spent most of its time in the code/decode functions. This is typically what you do not want.
Now, because you want an answer, there are some guidelines :
If you can store booleans with one byte only, then keep them as booleans (language/compiler dependant).
Concatenate all small features (between 3 and 256 possible values) in primitive types like int, double, long double or whatever your language uses. Then write functions to code/decode, using bitshift operators for speed if possible.
Keep features that can have "lots of" (more than 256) possible values as-is.
Of course, these are not absolutes. If you have a feature that can take exactly 2^15 values and another 2^17 values, then concatenate them in a primitive type that has a 32bits size.
TL;DR : There's a trade off between memory consumption and speed. You need to adjust according to your problem

Getting Unique Number Combinations

Is it possible without using exponentiation to have a set of numbers that when added together, always give unique sum?
I know it can be done with exponentiation (see first answer): The right way to manage user privileges (user hierarchy)
But I'm wondering if it's possible without exponentiation.
No, you can only use exponentiation, because the sum of lower values have to be less than the new number to be unique: 1+2=3 < 4, 1+2+4=7 < 8.
[EDIT:]
This is a laymen's explanation, of course there are other possibilities, but none as efficient as using exponentials of 2.
There's a chance it can be done without exponentation (I'm no math expert), but not in any way that's more efficient than exponentation. This is because it only takes one bit of storage space per possible value, and as an added plus you can use boolean operators to do useful stuff with the values.
If you restrict yourself to integers, the numbers have to grow at least as fast as an exponential function. If you find a function that grows faster (like, oh, maybe the Ackermann function) then the numbers produced by that will probably work too.
With floating-point numbers, you can keep adding unique irreducible roots of primes (sqrt(2), sqrt(3), sqrt(5), ...) and you will always get something unique, up until you hit the limits of floating-point precision. Not sure how many unique numbers you could squeeze out of it - maybe you should try it.
No. To see this directly, think about building up the set of basis values by considering at each step the smallest possible positive integer that could be included as the next value. The next number to add must be different from all possible sums of the numbers already in the set (including the empty sum, which is 0), and can't combine with any combination of numbers already present to produce a duplicate. So...
{} : all possible sums = {0}, smallest possible next = 1
{1} : all possible sums = {0, 1}, smallest possible next = 2
{1, 2} : all possible sums = {0, 1, 2, 3}, smallest possible next = 4
{1, 2, 4} : a.p.s. = {0, 1, 2, 3, 4, 5, 6, 7}, s.p.n. = 8
{1, 2, 4, 8} ...
And, of course, we're building up the binary powers. You could start with something other than {1, 2}, but look what happens, using the "smallest possible next" rule:
{1, 3} : a.p.s. = {0, 1, 3, 4}, s.p.n. = 6 (because 2 could be added to 1 giving 3, which is already there)
{1, 3, 6} : a.p.s. = {0, 1, 3, 4, 6, 7, 9, 10}, s.p.n = 11
{1, 3, 6, 11} ...
This sequence is growing faster than the binary powers, term by term.
If you want a nice Project-Euler-style programming challenge, you could write a routine that takes a set of positive integers and determines the "smallest possible next" positive integer, under the "sums must be unique" constraint.

Resources