How to extract 4 bit unsigned integer from 32 bit R integer? - r

I am reading binary files in R and need to read 10 bytes, which must be interpreted as 4 bit unsigned integers (2 per byte, so 20 values in range 0..15 I guess).
From my understanding of the docs, this cannot done with readBin directly because the minimal length to read, 1, means 1 byte.
So I think I need to read the data as 1 byte integers and use bit-wise operations to get the 4 bit integers. I found out that the values are stored as 32 bit integers internally by R, and I found this explanation on SO that seems to describe what I want to do. So here is my attempt at an R function that follows the advice:
#' #title Interprete bits start_index to stop_index of input int8 as unsigned integer.
uint8bits <- function(int8, start_index, stop_index) {
num_bits = stop_index - start_index + 1L;
bitmask = bitwShiftL((bitwShiftL(1L, num_bits) -1L), stop_index);
return(bitwShiftR(bitwAnd(int8, bitmask), start_index));
}
However, it does not work as intended, e.g, to get the two numbers out of the read value (255 in this example), I would call the function once to extract bits 1 to 4, and once more for bits 5 to 8:
value1 = uint8bits(255L, 1, 4); # I would expect 15, but the output is 120.
value2 = uint8bits(255L, 5, 8); # I would expect 15, but the output is 0.
What am I doing wrong?

We can use the packBits function to achieve your expected behaviour:
uint8.to.uint4 <- function(int8,start_index,stop_index)
{
bits <- intToBits(int8)
out <- packBits(c(bits[start_index:stop_index],
rep(as.raw(0),32-(stop_index-start_index+1))),type="integer")
return(out)
}
uint8.to.uint4(255L,1,4)
[1] 15
We first convert the integer to a bit vector, then extract the bits you like and pad the number with 0 to achieve the 32bit internal storage length for integers (32 bits). Then we can just convert with the packBits function back to an integer

Related

How does the modulus shorten a number to only 16 digits in this example?

I am learning Solidity by following various tutorials online. One of these tutorials is Cryptozombies. In this tutorial we create a zombie that has "dna" (a number). These numbers must be only 16 digits long to work with the code. We do this by defining
uint dnaDigits = 16;
uint dnaModulus = 10 ** dnaDigits;
At some point in the lesson, we define a function that generates "random" dna by passing a string to the keccak256 hash function.
function _generateRandomDna(string memory _str) private view returns (uint) {
uint rand = uint(keccak256(abi.encodePacked(_str)));
return rand % dnaModulus;
}
I am trying to figure out how the output of keccak256 % 10^16 is always a 16 digit integer. I guess part of the problem is that I don't exactly understand what keccak256 outputs. What I (think) I know is that it outputs a 256 bit number. I bit is either 0 or 1, so there are 2^256 possible outputs from this function?
If you need more information please let me know. I think included everything relevant from the code.
Whatever keccak256(abi.encodePacked(_str)) returns, it converted into uint because of the casting uint(...).
uint rand = uint(keccak256(abi.encodePacked(_str)));
When it's an uint then the simple math, because it's modulus.
xxxxx % 100 always < 100
I see now that this code is mean to produce a number that has no more than 16 digits, not exactly 16 digits.
Take x % 10^16
If x < 10^16, then then the remainder is x, and x is by definition less than 10^16, which is the smallest possible number with 17 digits I.e. every number less than 10^16 has 16 or fewer digits.
If x = 10^16, then remainder is 0 which has fewer than 16 digits.
If x > 10^16, then either 10^16 goes into x fully or not. If it does fully, remainder is zero which has less than 16 digits, if it goes in partially, then remainder is only a part of 10^16 which will always have 16 or fewer digits.

How to make R display 19 digit number as it is stored?

I have a dataset with a key column which is basically a 19 digit integer.
I'm using tibbles so I use options(pillar.sigfig = 22) to display larger numbers and not scientific notation.
Problem is, I notice that the number stored in the column and the one that is displayed are slightly different, to be specific last 3 digits are different.
E.g
options(pillar.sigfig = 22)
x <- 1099324498500011011
But when I try to return the number I get 1099324498500011008.
I'm not sure why R would change the last 3 digits and since it is a key, it makes my data unusable for analysis.
I have tried the usual options(scipen = 999) for suppressing scientific notation but it does not seem to work on tibbles.
How do I get the same 19 digit number as I intend to store it?
Sorry to be bearer of bad news but R only has
a numeric type (double) using 64 bits and approximately sixteen decimals precision
an integer type (int) using 32 bits
There is nothing else. You may force the print function to show you nineteen digits but that just means ... you are looking at three digits of randomness.
19 digits for (countable) items are common, and often provided by (signed or unsigned) int64_t types. Which R does not have natively but approximates via the integer64 call in the bit64 package.
So the following may be your only workaround:
> suppressMessages(library(bit64))
> x <- as.integer64("123456790123456789")
> x
integer64
[1] 123456790123456789
> x - 1
integer64
[1] 123456790123456788
>
The good news is that integer64 is reasonably well supported by data.table and a number of other packages.
PS It really is 19 digits where it bites:
> as.integer64(1.2e18) + 1
integer64
[1] 1200000000000000001
> as.integer64(1.2e19) + 1
integer64
[1] <NA>
Warning message:
In as.integer64.double(1.2e+19) : NAs produced by integer64 overflow
>

a negative unsigned int?

I'm trying to wrap my head around the truetype specification. On this page, in the section 'cmap' format 4, the parameter idDelta is listed as an unsigned 16-bits integer (UInt16). Yet, further down, a few examples are given, and here idDelta is given the values -9, -18, -27 and 1. How is this possible?
This is not a bug in the spec. The reason they show negative numbers in the idDelta row for the examples is that All idDelta[i] arithmetic is modulo 65536. (quoted from the section just above). Here's how that works.
The formula to get the glyph index is
glyphIndex = idDelta[i] + c
where c is the character code. Since this expression must be modulo 65536, that's equivalent to the following expression if you were using integers larger than 2 bytes :
glyphIndex = (idDelta[i] + c) % 65536
idDelta is a u16, so let's say it had the max value 65535 (0xFFFF), then glyphIndex would be equal to c - 1 since:
0xFFFF + 2 = 0x10001
0x10001 % 0x10000 = 1
You can think of this as a 16 integer wrapping around to 0 when an overflow occurs.
Now remember that a modulo is repeated division, keeping the remainder. Well in this case, since idDelta is only 16 bits, the max amount of divisions a modulo will need to do is 1, since the max value you can get from adding two 16 bit integers is 0x1FFFE , which is smaller than 0x100000. That means that a shortcut is to subtract 65536 (0x10000) instead of performing the modulo.
glyphIndex = (idDelta[i] - 0x10000) + c
And this is what the example shows as the values in the table. Here's an actual example from a .ttf file I've decoded :
I want the index for the character code 97 (lowercase 'a').
97 is greater than 32 and smaller than 126, so we use index 2 of the mappings.
idDelta[2] == 65507
glyphIndex = (65507 + 97) % 65536 === 68 which is the same as (65507 - 65536) + 97 === 68
The definition and use of idDelta on that page is not consistent. In the struct subheader it is defined as an int16, while a little earlier the same subheader is listed as UInt16*4.
It's probably a bug in the spec.
If you look at actual implementations, like this one from perl Tk, you'll see that idDelta is usually given as signed:
typedef struct SUBHEADER {
USHORT firstCode; /* First valid low byte for subHeader. */
USHORT entryCount; /* Number valid low bytes for subHeader. */
SHORT idDelta; /* Constant adder to get base glyph index. */
USHORT idRangeOffset; /* Byte offset from here to appropriate
* glyphIndexArray. */
} SUBHEADER;
Or see the implementation from libpdfxx:
struct SubHeader
{
USHORT firstCode;
USHORT entryCount;
SHORT idDelta;
USHORT idRangeOffset;
};

bit-shift operation in accelerometer code

I'm programming my Arduino micro controller and I found some code for accepting accelerometer sensor data for later use. I can understand all but the following code. I'd like to have some intuition as to what is happening but after all my searching and reading I can't wrap my head around what is going on and truly understand.
I have taken a class in C++ and we did very little with bitwise operations or bit shifting or whatever you'd like to call it. Let me try to explain what I think I understand and you can correct me where it is needed.
So:
I think we are storing a value in x, pretty sure in fact.
It appears that the data in array "buff", slot number 1, is being set to the datatype of integer.
The value in slot 1 is being bit shifted 8 places to the left.(does this point to buff slot 0?)
This new value is being compared to the data in buff slot 0 and if either bits are true then the bit in the data stored in x will also be true so, 0 and 1 = 1, 0 and 0 = 0 and 1 and 0 = 1 in the end stored value.
The code does this for all three axis: x, y, z but I'm not sure why...I need help. I want full understanding before I progress.
//each axis reading comes in 10 bit resolution, ie 2 bytes.
// Least Significant Byte first!!
//thus we are converting both bytes in to one int
x = (((int)buff[1]) << 8) | buff[0];
y = (((int)buff[3]) << 8) | buff[2];
z = (((int)buff[5]) << 8) | buff[4];
This code is being used to convert the raw accelerometer data (in an array of 6 bytes) into three 10-bit integer values. As the comment says, the data is LSB first. That is:
buff[0] // least significant 8 bits of x data
buff[1] // most significant 2 bits of x data
buff[2] // least significant 8 bits of y data
buff[3] // most significant 2 bits of y data
buff[4] // least significant 8 bits of z data
buff[5] // most significant 2 bits of z data
It's using bitwise operators two put the two parts together into a single variable. The (int) typecasts are unnecessary and (IMHO) confusing. This simplified expression:
x = (buff[1] << 8) | buff[0];
Takes the data in buff[1], and shifts it left 8 bits, and then puts the 8 bits from buff[0] in the space so created. Let's label the 10 bits a through j for example's sake:
buff[0] = cdefghij
buff[1] = 000000ab
Then:
buff[1] << 8 = ab00000000
And:
buff[1] << 8 | buff[0] = abcdefghij
The value in slot 1 is being bit shifted 8 places to the left.(does this point to buff slot 0?)
Nah. Bitwise operators ain't pointer arithmetic, don't confuse the two. Shifting by N places to the left is (roughly) equivalent with multiplying by 2 to the Nth power (except some corner cases in C, but let's not talk about those yet).
This new value is being compared to the data in buff slot 0 and if either bits are true then the bit in the data stored in x will also be true
No. | is not the logical OR operator (that would be ||) but the bitwise OR one. All the code does is combining the two bytes in buff[0] and buff[1] into a single 2-byte integer, where buff[1] denotes the MSB of the number.
The device result is in 6 bytes and the bytes need to be rearranged into 3 integers (having values that can only take up 10 bits at most).
So the first two bytes look like this:
00: xxxx xxxx <- binary value
01: ???? ??xx
The ??? part isn't part of the result because the xxx part comprise the 10 bits. I guess the hardware is built in such a way that the ??? part is all zero bits.
To get this into a single integer variable, we need all 8 of the low bits plus the upper-order 2 bits, shifted left by 8 position so they don't interfere with the low order 8 bits. The logical OR (| - vertical bar) will join those two parts into a single integer that looks like this:
x: ???? ??xx xxxx xxxx <- binary value of a single 16 bit integer
Actually it doesn't matter how big the 'int' is (in bits) as the remaining bits (beyond that 16) will be zero in this case.
to expand and clarify the reply by Carl Norum.
The (int) typecast is required because the source is a byte. The bitshift is performed on the source datatype before the result is saved into X. Therefore it must be cast to at least 16 bits (an int) in order to bitshift 8 bits and retain all the data before the OR operation is executed and the result saved.
What the code is not telling you is if this should be an unsigned int or if there is a sign in the bit data. I'd expect -ve data is possible with an Accelerometer.

How to efficiently convert a few bytes into an integer between a range?

I'm writing something that reads bytes (just a List<int>) from a remote random number generation source that is extremely slow. For that and my personal requirements, I want to retrieve as few bytes from the source as possible.
Now I am trying to implement a method which signature looks like:
int getRandomInteger(int min, int max)
I have two theories how I can fetch bytes from my random source, and convert them to an integer.
Approach #1 is naivé . Fetch (max - min) / 256 number of bytes and add them up. It works, but it's going to fetch a lot of bytes from the slow random number generator source I have. For example, if I want to get a random integer between a million and a zero, it's going to fetch almost 4000 bytes... that's unacceptable.
Approach #2 sounds ideal to me, but I'm unable come up with the algorithm. it goes like this:
Lets take min: 0, max: 1000 as an example.
Calculate ceil(rangeSize / 256) which in this case is ceil(1000 / 256) = 4. Now fetch one (1) byte from the source.
Scale this one byte from the 0-255 range to 0-3 range (or 1-4) and let it determine which group we use. E.g. if the byte was 250, we would choose the 4th group (which represents the last 250 numbers, 750-1000 in our range).
Now fetch another byte and scale from 0-255 to 0-250 and let that determine the position within the group we have. So if this second byte is e.g. 120, then our final integer is 750 + 120 = 870.
In that scenario we only needed to fetch 2 bytes in total. However, it's much more complex as if our range is 0-1000000 we need several "groups".
How do I implement something like this? I'm okay with Java/C#/JavaScript code or pseudo code.
I'd also like to keep the result from not losing entropy/randomness. So, I'm slightly worried of scaling integers.
Unfortunately your Approach #1 is broken. For example if min is 0 and max 510, you'd add 2 bytes. There is only one way to get a 0 result: both bytes zero. The chance of this is (1/256)^2. However there are many ways to get other values, say 100 = 100+0, 99+1, 98+2... So the chance of a 100 is much larger: 101(1/256)^2.
The more-or-less standard way to do what you want is to:
Let R = max - min + 1 -- the number of possible random output values
Let N = 2^k >= mR, m>=1 -- a power of 2 at least as big as some multiple of R that you choose.
loop
b = a random integer in 0..N-1 formed from k random bits
while b >= mR -- reject b values that would bias the output
return min + floor(b/m)
This is called the method of rejection. It throws away randomly selected binary numbers that would bias the output. If min-max+1 happens to be a power of 2, then you'll have zero rejections.
If you have m=1 and min-max+1 is just one more than a biggish power of 2, then rejections will be near half. In this case you'd definitely want bigger m.
In general, bigger m values lead to fewer rejections, but of course they require slighly more bits per number. There is a probabilitistically optimal algorithm to pick m.
Some of the other solutions presented here have problems, but I'm sorry right now I don't have time to comment. Maybe in a couple of days if there is interest.
3 bytes (together) give you random integer in range 0..16777215. You can use 20 bits from this value to get range 0..1048575 and throw away values > 1000000
range 1 to r
256^a >= r
first find 'a'
get 'a' number of bytes into array A[]
num=0
for i=0 to len(A)-1
num+=(A[i]^(8*i))
next
random number = num mod range
Your random source gives you 8 random bits per call. For an integer in the range [min,max] you would need ceil(log2(max-min+1)) bits.
Assume that you can get random bytes from the source using some function:
bool RandomBuf(BYTE* pBuf , size_t nLen); // fill buffer with nLen random bytes
Now you can use the following function to generate a random value in a given range:
// --------------------------------------------------------------------------
// produce a uniformly-distributed integral value in range [nMin, nMax]
// T is char/BYTE/short/WORD/int/UINT/LONGLONG/ULONGLONG
template <class T> T RandU(T nMin, T nMax)
{
static_assert(std::numeric_limits<T>::is_integer, "RandU: integral type expected");
if (nMin>nMax)
std::swap(nMin, nMax);
if (0 == (T)(nMax-nMin+1)) // all range of type T
{
T nR;
return RandomBuf((BYTE*)&nR, sizeof(T)) ? *(T*)&nR : nMin;
}
ULONGLONG nRange = (ULONGLONG)nMax-(ULONGLONG)nMin+1 ; // number of discrete values
UINT nRangeBits= (UINT)ceil(log((double)nRange) / log(2.)); // bits for storing nRange discrete values
ULONGLONG nR ;
do
{
if (!RandomBuf((BYTE*)&nR, sizeof(nR)))
return nMin;
nR= nR>>((sizeof(nR)<<3) - nRangeBits); // keep nRangeBits random bits
}
while (nR >= nRange); // ensure value in range [0..nRange-1]
return nMin + (T)nR; // [nMin..nMax]
}
Since you are always getting a multiple of 8 bits, you can save extra bits between calls (for example you may need only 9 bits out of 16 bits). It requires some bit-manipulations, and it is up to you do decide if it is worth the effort.
You can save even more, if you'll use 'half bits': Let's assume that you want to generate numbers in the range [1..5]. You'll need log2(5)=2.32 bits for each random value. Using 32 random bits you can actually generate floor(32/2.32)= 13 random values in this range, though it requires some additional effort.

Resources