SQLite: How to reduce byte size of integer values?

SQLite: How to reduce byte size of integer values? - sqlite

I have a SQLite table (without row ID, but that's probably irrelevant, and without any indexes) where my rows contain the following data:
2 real values, one of which is the primary key
3 integers < 100
1 more field for integers, but currently always null
According to http://www.sqlite.org/datatype3.html, integer values can take 1, 2, 3, 4, 6 or 8 bytes according to their magnitude. Therefore I'd expect each row in my table to take up about 20 bytes. In reality, sqlite3_analyzer gives me for the table
Average payload per entry......................... 25.65
Maximum payload per entry......................... 26
which is somewhere in between the minimum value of 20 and the maximum of 32 (if all integers were stored with 4 bytes). Is it possible to give the DB a "hint" to use even smaller integer types wherever possible? Or how else can the discrepancy be explained? (I don't think it's indexes because there are none for this table.)
Similarly, on a previous table I had 2 real values + 2 small integers and each entry occupied slightly more than 24 bytes (which is also more than I would have expected).
Also, there is no way to store floats in single precision with SQLite right?

The actual record format has one integer for the header size, one integer for each column to describe the value's type, and all the data of the column values.
In this case, we have:
bytes
1 header size
6 column types
16 two real values
3 three small integers between 2 and 127
0 NULL
--
26

Related

Encode numbers with letters with fixed lentgh?

I have two unique numbers, 100000 - 999999 (fixed 6 chars length [0-9]), second
1000000 - 9999999 (fixed 7 char length [0-9]). How can i encode/decode this numbers (they need to remain separate after decoding), using only uppercase letters [A-Z] and [0-9] digits and have a fixed length of 8 chars in total?
Example:
input -> num_1: 242404, num_2 : 1002000
encode -> AX3B O3XZ
decode -> 2424041002000
Is there any algorithm for this type of problem?

This is just a simple mapping from one set of values to another set of values. The procedure is always the same:
List all possible input and output values.
Find the index of the input.
Return the value of the output list at that index.
Note that it's often not necessary to make an actual list (i.e. loading all values into some data structure). You can typically compute the value for any index on-demand. This case is no different.
Imagine a list of all possible input pairs:
0 100'000, 1'000'000
1 100'000, 1'000'001
2 100'000, 1'000'002
...
K 100'000, 9'999'999
K+1 100'001, 1'000'000
K+2 100'001, 1'000'001
...
N-1 999'999, 9'999'998
N 999'999, 9'999'999
For any given pair (a, b), you can compute its index i in this list like so:
// Make a and b zero-based
a -= 100'000
b -= 1'000'000
i = a*1'000'000 + b
Convert i to base 36 (A-Z and 0-9 gives you 36 symbols), pad on the left with zeros as necessary1, and insert a space after the fourth digit.
encoded = addSpace(zeroPad(base36(i)))
To get back to the input pair:
Convert the 8-character base 36 string to base 10 (this is the index into the list, remember), then derive a and b from the index.
i = base10(removeSpace(encoded))
a = i/1'000'000 + 100'000 // integer divison (i.e. ignore remainder)
b = i%1'000'000 + 1'000'000
Here is an implementation in Go: https://play.golang.org/p/KQu9Hcoz5UH
1 If you don't like the idea of zero padding you can also offset i at this point. The target set of values is plenty big enough, you need only about 32% of all base 36 numbers with eight digits or less.

R: least number of sets with decimal numbers on some condition?

Consider
> data.frame(n=runif(6),m=1:6)
n m
1 0.44000000 1
2 0.12102262 2
3 0.95483015 3
4 0.35628753 4
5 0.55000000 5
6 0.50189420 6
where you want to form the least number of sets having decimal numbers where the sum of numbers is less than 1.
Example trial to find the partions, not necessarily optimal way to find the partitions (particularly with bigger sets)
For example, a partition is a set of number 3 because it is less than one i.e. 0.95483015<1. Then other partition is a set of 5 and 1 because 0.55+0.44<1. And rest numbers go to third partitions such that
partition: 3
partition: 5,1
partition: 2,4,6
now I have a big list of numbers like that which I need to make into least number of partitions or least number of sets having decimal numbers.
Does there exist some R package to find partitions with some optimal criteria like the least number of partitions with some condition?

Representing decimal numbers in binary

How do I represent integers numbers, for example, 23647 in two bytes, where one byte contains the last two digits (47) and the other contains the rest of the digits(236)?

There are several ways do to this.
One way is to try to use Binary Coded Decimal (BCD). This codes decimal digits, rather than the number as a whole into binary. The packed form puts two decimal digits into a byte. However, your example value 23647 has five decimal digits and will not fit into two bytes in BCD. This method will fit values up to 9999.
Another way is to put each of your two parts in binary and place each part into a byte. You can do integer division by 100 to get the upper part, so in Python you could use
upperbyte = 23647 // 100
Then the lower part can be gotten by the modulus operation:
lowerbyte = 23647 % 100
Python will directly convert the results into binary and store them that way. You can do all this in one step in Python and many other languages:
upperbyte, lowerbyte = divmod(23647, 100)
You are guaranteed that the lowerbyte value fits, but if the given value is too large the upperbyte value many not actually fit into a byte. All this assumes that the value is positive, since negative values would complicate things.
(This following answer was for a previous version of the question, which was to fit a floating-point number like 36.47 into two bytes, one byte for the integer part and another byte for the fractional part.)
One way to do that is to "shift" the number so you consider those two bytes to be a single integer.
Take your value (36.47), multiply it by 256 (the number of values that fit into one byte), round it to the nearest integer, convert that to binary. The bottom 8 bits of that value are the "decimal numbers" and the next 8 bits are the "integer value." If there are any other bits still remaining, your number was too large and there is an overflow condition.
This assumes you want to handle only non-negative values. Handling negatives complicates things somewhat. The final result is only an approximation to your starting value, but that is the best you can do.
Doing those calculations on 36.47 gives the binary integer
10010001111000
So the "decimal byte" is 01111000 and the "integer byte" is 100100 or 00100100 when filled out to 8 bits. This represents the float number 36.46875 exactly and your desired value 36.47 approximately.

bit-shift operation in accelerometer code

I'm programming my Arduino micro controller and I found some code for accepting accelerometer sensor data for later use. I can understand all but the following code. I'd like to have some intuition as to what is happening but after all my searching and reading I can't wrap my head around what is going on and truly understand.
I have taken a class in C++ and we did very little with bitwise operations or bit shifting or whatever you'd like to call it. Let me try to explain what I think I understand and you can correct me where it is needed.
So:
I think we are storing a value in x, pretty sure in fact.
It appears that the data in array "buff", slot number 1, is being set to the datatype of integer.
The value in slot 1 is being bit shifted 8 places to the left.(does this point to buff slot 0?)
This new value is being compared to the data in buff slot 0 and if either bits are true then the bit in the data stored in x will also be true so, 0 and 1 = 1, 0 and 0 = 0 and 1 and 0 = 1 in the end stored value.
The code does this for all three axis: x, y, z but I'm not sure why...I need help. I want full understanding before I progress.
//each axis reading comes in 10 bit resolution, ie 2 bytes.
// Least Significant Byte first!!
//thus we are converting both bytes in to one int
x = (((int)buff[1]) << 8) | buff[0];
y = (((int)buff[3]) << 8) | buff[2];
z = (((int)buff[5]) << 8) | buff[4];

This code is being used to convert the raw accelerometer data (in an array of 6 bytes) into three 10-bit integer values. As the comment says, the data is LSB first. That is:
buff[0] // least significant 8 bits of x data
buff[1] // most significant 2 bits of x data
buff[2] // least significant 8 bits of y data
buff[3] // most significant 2 bits of y data
buff[4] // least significant 8 bits of z data
buff[5] // most significant 2 bits of z data
It's using bitwise operators two put the two parts together into a single variable. The (int) typecasts are unnecessary and (IMHO) confusing. This simplified expression:
x = (buff[1] << 8) | buff[0];
Takes the data in buff[1], and shifts it left 8 bits, and then puts the 8 bits from buff[0] in the space so created. Let's label the 10 bits a through j for example's sake:
buff[0] = cdefghij
buff[1] = 000000ab
Then:
buff[1] << 8 = ab00000000
And:
buff[1] << 8 | buff[0] = abcdefghij

The value in slot 1 is being bit shifted 8 places to the left.(does this point to buff slot 0?)
Nah. Bitwise operators ain't pointer arithmetic, don't confuse the two. Shifting by N places to the left is (roughly) equivalent with multiplying by 2 to the Nth power (except some corner cases in C, but let's not talk about those yet).
This new value is being compared to the data in buff slot 0 and if either bits are true then the bit in the data stored in x will also be true
No. | is not the logical OR operator (that would be ||) but the bitwise OR one. All the code does is combining the two bytes in buff[0] and buff[1] into a single 2-byte integer, where buff[1] denotes the MSB of the number.

The device result is in 6 bytes and the bytes need to be rearranged into 3 integers (having values that can only take up 10 bits at most).
So the first two bytes look like this:
00: xxxx xxxx <- binary value
01: ???? ??xx
The ??? part isn't part of the result because the xxx part comprise the 10 bits. I guess the hardware is built in such a way that the ??? part is all zero bits.
To get this into a single integer variable, we need all 8 of the low bits plus the upper-order 2 bits, shifted left by 8 position so they don't interfere with the low order 8 bits. The logical OR (| - vertical bar) will join those two parts into a single integer that looks like this:
x: ???? ??xx xxxx xxxx <- binary value of a single 16 bit integer
Actually it doesn't matter how big the 'int' is (in bits) as the remaining bits (beyond that 16) will be zero in this case.

to expand and clarify the reply by Carl Norum.
The (int) typecast is required because the source is a byte. The bitshift is performed on the source datatype before the result is saved into X. Therefore it must be cast to at least 16 bits (an int) in order to bitshift 8 bits and retain all the data before the OR operation is executed and the result saved.
What the code is not telling you is if this should be an unsigned int or if there is a sign in the bit data. I'd expect -ve data is possible with an Accelerometer.

How to efficiently convert a few bytes into an integer between a range?

I'm writing something that reads bytes (just a List<int>) from a remote random number generation source that is extremely slow. For that and my personal requirements, I want to retrieve as few bytes from the source as possible.
Now I am trying to implement a method which signature looks like:
int getRandomInteger(int min, int max)
I have two theories how I can fetch bytes from my random source, and convert them to an integer.
Approach #1 is naivé . Fetch (max - min) / 256 number of bytes and add them up. It works, but it's going to fetch a lot of bytes from the slow random number generator source I have. For example, if I want to get a random integer between a million and a zero, it's going to fetch almost 4000 bytes... that's unacceptable.
Approach #2 sounds ideal to me, but I'm unable come up with the algorithm. it goes like this:
Lets take min: 0, max: 1000 as an example.
Calculate ceil(rangeSize / 256) which in this case is ceil(1000 / 256) = 4. Now fetch one (1) byte from the source.
Scale this one byte from the 0-255 range to 0-3 range (or 1-4) and let it determine which group we use. E.g. if the byte was 250, we would choose the 4th group (which represents the last 250 numbers, 750-1000 in our range).
Now fetch another byte and scale from 0-255 to 0-250 and let that determine the position within the group we have. So if this second byte is e.g. 120, then our final integer is 750 + 120 = 870.
In that scenario we only needed to fetch 2 bytes in total. However, it's much more complex as if our range is 0-1000000 we need several "groups".
How do I implement something like this? I'm okay with Java/C#/JavaScript code or pseudo code.
I'd also like to keep the result from not losing entropy/randomness. So, I'm slightly worried of scaling integers.

Unfortunately your Approach #1 is broken. For example if min is 0 and max 510, you'd add 2 bytes. There is only one way to get a 0 result: both bytes zero. The chance of this is (1/256)^2. However there are many ways to get other values, say 100 = 100+0, 99+1, 98+2... So the chance of a 100 is much larger: 101(1/256)^2.
The more-or-less standard way to do what you want is to:
Let R = max - min + 1 -- the number of possible random output values
Let N = 2^k >= mR, m>=1 -- a power of 2 at least as big as some multiple of R that you choose.
loop
b = a random integer in 0..N-1 formed from k random bits
while b >= mR -- reject b values that would bias the output
return min + floor(b/m)
This is called the method of rejection. It throws away randomly selected binary numbers that would bias the output. If min-max+1 happens to be a power of 2, then you'll have zero rejections.
If you have m=1 and min-max+1 is just one more than a biggish power of 2, then rejections will be near half. In this case you'd definitely want bigger m.
In general, bigger m values lead to fewer rejections, but of course they require slighly more bits per number. There is a probabilitistically optimal algorithm to pick m.
Some of the other solutions presented here have problems, but I'm sorry right now I don't have time to comment. Maybe in a couple of days if there is interest.

3 bytes (together) give you random integer in range 0..16777215. You can use 20 bits from this value to get range 0..1048575 and throw away values > 1000000

range 1 to r
256^a >= r
first find 'a'
get 'a' number of bytes into array A[]
num=0
for i=0 to len(A)-1
num+=(A[i]^(8*i))
next
random number = num mod range

Your random source gives you 8 random bits per call. For an integer in the range [min,max] you would need ceil(log2(max-min+1)) bits.
Assume that you can get random bytes from the source using some function:
bool RandomBuf(BYTE* pBuf , size_t nLen); // fill buffer with nLen random bytes
Now you can use the following function to generate a random value in a given range:
// --------------------------------------------------------------------------
// produce a uniformly-distributed integral value in range [nMin, nMax]
// T is char/BYTE/short/WORD/int/UINT/LONGLONG/ULONGLONG
template <class T> T RandU(T nMin, T nMax)
{
static_assert(std::numeric_limits<T>::is_integer, "RandU: integral type expected");
if (nMin>nMax)
std::swap(nMin, nMax);
if (0 == (T)(nMax-nMin+1)) // all range of type T
{
T nR;
return RandomBuf((BYTE*)&nR, sizeof(T)) ? *(T*)&nR : nMin;
}
ULONGLONG nRange = (ULONGLONG)nMax-(ULONGLONG)nMin+1 ; // number of discrete values
UINT nRangeBits= (UINT)ceil(log((double)nRange) / log(2.)); // bits for storing nRange discrete values
ULONGLONG nR ;
do
{
if (!RandomBuf((BYTE*)&nR, sizeof(nR)))
return nMin;
nR= nR>>((sizeof(nR)<<3) - nRangeBits); // keep nRangeBits random bits
}
while (nR >= nRange); // ensure value in range [0..nRange-1]
return nMin + (T)nR; // [nMin..nMax]
}
Since you are always getting a multiple of 8 bits, you can save extra bits between calls (for example you may need only 9 bits out of 16 bits). It requires some bit-manipulations, and it is up to you do decide if it is worth the effort.
You can save even more, if you'll use 'half bits': Let's assume that you want to generate numbers in the range [1..5]. You'll need log2(5)=2.32 bits for each random value. Using 32 random bits you can actually generate floor(32/2.32)= 13 random values in this range, though it requires some additional effort.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex