ATTINY84: Weird problem with reversed byte order - arduino

I am encoding 6 values (4x 3bit + 1bit) into a 16bit integer and transfer them via serial to an ATTINY84 splitting them into 2 bytes. That works all good until the point that I re-assemble the bytes into a 16bit int.
Example:
I am sending the following binary state 0001110000001100 which translates to 7180 and gets split into a byte array of [18, 28].
I am putting that byte array into the EEPROM and read it on the next power cycle.
After power cycle my serial debug output looks like this:
18
28
7180
Awesome. Looks all good and my code for that part is:
byte d0 = EEPROM.read(0);
byte d1 = EEPROM.read(1);
unsigned int w = d0 + (256 * d1);
But now the weirdest thing happens. When I do a bit-by-bit read I am getting back:
0011000000111000
should be:
0001110000001100
via:
for(byte t = 0; t < 16; t++) {
serial.print(bitRead(w, t) ? "1" : "0");
}
The bit representation is completely reversed. How is that possible? Or maybe I am missing something.
Also I confirmed when I extract the actual 3 bit location to receive my original value 0..7 it's all off.
Any help would be appreciated.

So it looks like I fell into the little/big endian trap.
Basically as Alain said, in the comments - everything is correct and it's just the representations.
I came up with the following method that can extract bits from a little endian stored number that needs to be in a big endian format:
/**
* #bex
*/
uint8_t bexd(uint16_t n, uint8_t o, uint8_t l, uint8_t d) {
uint8_t v = 0;
uint8_t ob = d - o;
for (uint8_t b=ob; b > (ob-l); b--) v = ( v << 1 ) | ( 0x0001 & ( n >> (b-1) ) );
return v;
}
uint8_t bexw(uint16_t n, uint8_t o, uint8_t l) {return bexd(n, o, l, 16);}
uint8_t bexb(uint8_t n, uint8_t o, uint8_t l) {return bexd(n, o, l, 8);}
For example:
In big endian the "second" value is stored in bit 3,4, and 5, compared to little endian where it will be stored in bit 10, 11, and 12. The method above allows to work a "little endian" value like it would be an "big endian" value.
To extract the second value from this value 0011000000111000 just do:
byte v = bex(7180, 3, 3); // 111
Serial.println(v); // prints 255
Hope that helps someone.

Related

How to bruteforce a lossy AND routine?

Im wondering whether there are any standard approaches to reversing AND routines by brute force.
For example I have the following transformation:
MOV(eax, 0x5b3e0be0) <- Here we move 0x5b3e0be0 to EDX.
MOV(edx, eax) # Here we copy 0x5b3e0be0 to EAX as well.
SHL(edx, 0x7) # Bitshift 0x5b3e0be0 with 0x7 which results in 0x9f05f000
AND(edx, 0x9d2c5680) # AND 0x9f05f000 with 0x9d2c5680 which results in 0x9d045000
XOR(edx, eax) # XOR 0x9d045000 with original value 0x5b3e0be0 which results in 0xc63a5be0
My question is how to brute force and reverse this routine (i.e. transform 0xc63a5be0 back into 0x5b3e0be0)
One idea i had (which didn't work) was this using PeachPy implementation:
#Input values
MOV(esi, 0xffffffff) < Initial value to AND with, which will be decreased by 1 in a loop.
MOV(cl, 0x1) < Initial value to SHR with which will be increased by 1 until 0x1f.
MOV(eax, 0xc63a5be0) < Target result which I'm looking to get using the below loop.
MOV(edx, 0x5b3e0be0) < Input value which will be transformed.
sub_esi = peachpy.x86_64.Label()
with loop:
#End the loop if ESI = 0x0
TEST(esi, esi)
JZ(loop.end)
#Test the routine and check if it matches end result.
MOV(ebx, eax)
SHR(ebx, cl)
TEST(ebx, ebx)
JZ(sub_esi)
AND(ebx, esi)
XOR(ebx, eax)
CMP(ebx, edx)
JZ(loop.end)
#Add to the CL register which is used for SHR.
#Also check if we've reached the last potential value of CL which is 0x1f
ADD(cl, 0x1)
CMP(cl, 0x1f)
JNZ(loop.begin)
#Decrement ESI by 1, reset CL and restart routine.
peachpy.x86_64.LABEL(sub_esi)
SUB(esi, 0x1)
MOV(cl, 0x1)
JMP(loop.begin)
#The ESI result here will either be 0x0 or a valid value to AND with and get the necessary result.
RETURN(esi)
Maybe an article or a book you can recommend specific to this?
It's not lossy, the final operation is an XOR.
The whole routine can be modeled in C as
#define K 0x9d2c5680
uint32_t hash(uint32_t num)
{
return num ^ ( (num << 7) & K);
}
Now, if we have two bits x and y and the operation x XOR y, when y is zero the result is x.
So given two numbers n1 and n2 and considering their XOR, the bits or n1 that pairs with a zero in n2 would make it to the result unchanged (the others will be flipped).
So in considering num ^ ( (num << 7) & K) we can identify num with n1 and (num << 7) & K with n2.
Since n2 is an AND, we can tell that it must have at least the same zero bits that K has.
This means that each bit of num that corresponds to a zero bit in the constant K will make it unchanged into the result.
Thus, by extracting those bits from the result we already have a partial inverse function:
/*hash & ~K extracts the bits of hash that pair with a zero bit in K*/
partial_num = hash & ~K
Technically, the factor num << 7 would also introduce other zeros in the result of the AND. We know for sure that the lowest 7 bits must be zero.
However K already has the lowest 7 bits zero, so we cannot exploit this information.
So we will just use K here, but if its value were different you'd need to consider the AND (which, in practice, means to zero the lower 7 bits of K).
This leaves us with 13 bits unknown (the ones corresponding to the bits that are set in K).
If we forget about the AND for a moment, we would have x ^ (x << 7) meaning that
hi = numi for i from 0 to 6 inclusive
hi = numi ^ numi-7 for i from 7 to 31 inclusive
(The first line is due to the fact that the lower 7 bits of the right-hand are zero)
From this, starting from h7 and going up, we can retrive num7 as h7 ^ num0 = h7 ^ h0.
From bit 7 onward, the equality doesn't work and we need to use numk (for the suitable k) but luckily we already have computed its value in a previous step (that's why we start from lower to higher).
What the AND does to this is just restricting the values the index i runs in, specifically only to the bits that are set in K.
So to fill in the thirteen remaining bits one have to do:
part_num7 = h7 ^ part_num0
part_num9 = h9 ^ part_num2
part_num12 = h12 ^ part_num5
...
part_num31 = h31 ^ part_num24
Note that we exploited that fact that part_num0..6 = h0..6.
Here's a C program that inverts the function:
#include <stdio.h>
#include <stdint.h>
#define BIT(i, hash, result) ( (((result >> i) ^ (hash >> (i+7))) & 0x1) << (i+7) )
#define K 0x9d2c5680
uint32_t base_candidate(uint32_t hash)
{
uint32_t result = hash & ~K;
result |= BIT(0, hash, result);
result |= BIT(2, hash, result);
result |= BIT(3, hash, result);
result |= BIT(5, hash, result);
result |= BIT(7, hash, result);
result |= BIT(11, hash, result);
result |= BIT(12, hash, result);
result |= BIT(14, hash, result);
result |= BIT(17, hash, result);
result |= BIT(19, hash, result);
result |= BIT(20, hash, result);
result |= BIT(21, hash, result);
result |= BIT(24, hash, result);
return result;
}
uint32_t hash(uint32_t num)
{
return num ^ ( (num << 7) & K);
}
int main()
{
uint32_t tester = 0x5b3e0be0;
uint32_t candidate = base_candidate(hash(tester));
printf("candidate: %x, tester %x\n", candidate, tester);
return 0;
}
Since the original question was how to "bruteforce" instead of solve here's something that I eventually came up with which works just as well. Obviously its prone to errors depending on input (might be more than 1 result).
from peachpy import *
from peachpy.x86_64 import *
input = 0xc63a5be0
x = Argument(uint32_t)
with Function("DotProduct", (x,), uint32_t) as asm_function:
LOAD.ARGUMENT(edx, x) # EDX = 1b6fb67c
MOV(esi, 0xffffffff)
with Loop() as loop:
TEST(esi,esi)
JZ(loop.end)
MOV(eax, esi)
SHL(eax, 0x7)
AND(eax, 0x9d2c5680)
XOR(eax, esi)
CMP(eax, edx)
JZ(loop.end)
SUB(esi, 0x1)
JMP(loop.begin)
RETURN(esi)
#Read Assembler Return
abi = peachpy.x86_64.abi.detect()
encoded_function = asm_function.finalize(abi).encode()
python_function = encoded_function.load()
print(hex(python_function(input)))

What is the best practice for memory access in this N-body problem solved on AMD Radeon RX580?

I compute trajectories of N particles which move in their gravitation force field. I wrote the following OpenCL kernel:
#define G 100.0f
#define EPS 1.0f
float2 f (float2 r_me, __constant float *m, __global float2 *r, size_t s, size_t n)
{
size_t i;
float2 res = (0.0f, 0.0f);
for (i=1; i<n; i++) {
size_t idx = i;
// size_t idx = (i + s) % n;
float2 dir = r[idx] - r_me;
float dist = length (dir);
res += G*m[idx]/pown(dist + EPS, 3) * dir;
}
return res;
}
__kernel void take_step_rk2 (__constant float *m,
__global float2 *r,
__global float2 *v,
float delta)
{
size_t n = get_global_size(0);
size_t s = get_global_id(0);
float2 mv = f(r[s], m, r, s, n);
float2 mr = v[s];
float2 vpred1 = v[s] + mv * delta;
float2 rpred1 = r[s] + mr * delta;
float2 nv = f(rpred1, m, r, s, n);
float2 nr = vpred1;
barrier (CLK_GLOBAL_MEM_FENCE);
r[s] += (mr + nr) * delta / 2;
v[s] += (mv + nv) * delta / 2;
}
Then I run this kernel multiple times as one-dimensional problem with global work size = [number of bodies]:
void take_step (struct cl_state *state)
{
size_t n = state->nbodies;
clEnqueueNDRangeKernel (state->queue, state->step, 1, NULL, &n, NULL, 0, NULL, NULL);
clFinish (state->queue);
}
This is a quote from AMD OpenCL Optimization Guide (year 2015):
Under certain conditions, one unexpected case of a channel conflict is that reading from the same address is a conflict, even on the FastPath.
This does not happen on the read-only memories, such as constant buffers,
textures, or shader resource view (SRV); but it is possible on the read/write UAV
memory or OpenCL global memory.
Work items in my queue all try to get access to the same memory in this loop, so there must be a channel conflict:
for (i=1; i<n; i++) {
size_t idx = i;
// size_t idx = (i + s) % n;
float2 dir = r[idx] - r_me;
float dist = length (dir);
res += G*m[idx]/pown(dist + EPS, 3) * dir;
}
I replaced
size_t idx = i;
// size_t idx = (i + s) % n;
with
// size_t idx = i;
size_t idx = (i + s) % n;
so the first work item (with global id 0) firstly access the first element in array r, the second work item access the second element and so on.
I expected that this change must result in performance improvement, but to the contrary, it resulted in significant performance degradation (roughly by the factor of 2). What am I missing? Why all-to-the-same memory access it better in this situation?
If you have other tips how to improve the performance, please share with me. OpenCL optimization guide is very confusing.
The f function's loop does not have a barrier for reconvergence for coalesced access. Once some items get their r data, they start computing but those couldn't will wait their data hence, lose the coalescence integrity. To re-group them, add 1 barrier at least per 10 iterations or 2 iterations or maybe even every iteration. But accessing to global has high latency. Barrier + latency is bad for performance. You need local memory here since it has low latency and broadcasting ability which lets it lose coalescedness only on grains bigger than local thread number (64?) which is not bad for global memory access either(you need to fill local memory from global in every Kth iteration where N is divided into K sized groups).
A source from 2013 (
http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf):
Thus, the key to effectively using the LDS is to control the access
pattern, so that accesses generated on the same cycle map to different
banks in the LDS. One notable exception is that accesses to the same
address (even though they have the same bits 6:2) can be broadcast to
all requestors and do not generate a bank conflict.
Using LDS(__local) for this will give good performance. Since LDS is small, you should do it in small patches like 256 particles at a time.
Also, using i as idx is very cache friendly but modulus version is very cache enemy. Once data can exist in cache, it doesn't matter if N requests are done. They come from cache now. But with modulus, you destroy cache ingredients before they are re-used, depending on N. For small N it should be faster as you foresee. For big N, and with small GPU cache, it would be much worse. Like only 1 global request per cycle versus N-cache_size global requests per cycle.
I guess with such strong GPU, you had a high N value such as 64k bodies which needed 2 variables per body and 4 bytes per variable totaling 512kB which can not fit L1. Maybe only L2 which is slower than idx=i through L1.
Answer:
all to same L1 cache adr is faster than all to global and L2 cache adr
use local memory in "blocking/patching" algorithm to achieve high speed

How to get the first x leading binary digits of 5**x without big integer multiplication

I want to efficiently and elegantly compute with perfect precision the first x leading binary digits of 5**x?
For example 5**20 is 10101101011110001110101111000101101011000110001. The first 8 leading binary digits is 10101101.
In my use case, x is only up to 1-60. I don't want to create a table. A solution using 64-bit integers would be fine. I just don't want to use big integers.
first x leading binary digits of 5**x without big integer multiplication
efficiently and elegantly compute with perfect precision the first x leading binary digits of 5x?
"compute with perfect precision" leaves out pow(). Too many implementations will return an imperfect result and FP math might not use 64 bit precision, even with long double.
Form an integer with a 64-bit whole number part .ms and a 64-bit fraction part .ls. Then loop 60 times, multiply by 5 and diving by 2 as needed, to keep the leading bits from growing too big.
Note there is some precision lost in the fraction, with N > 42, yet that is not significant enough to affect the whole number part OP is seeking.
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
typedef struct {
uint64_t ms, ls;
} uint128;
// Simplifications possible here, leave for OP
uint128 times5(uint128 x) {
uint128 y = x;
for (int i=1; i<5; i++) {
// y += x
y.ms += x.ms;
y.ls += x.ls;
if (y.ls < x.ls) y.ms++;
}
return y;
}
uint128 div2(uint128 x) {
x.ls = (x.ls >> 1) | (x.ms << 63);
x.ms >>= 1;
return x;
}
int main(void) {
uint128 y = {.ms = 1};
uint64_t pow2 = 2;
for (unsigned x = 1; x <= 60; x++) {
y = times5(y);
while (y.ms >= pow2) {
y = div2(y);
}
printf("%2u %16" PRIX64 ".%016" PRIX64 "\n", x, y.ms, y.ls);
pow2 <<= 1;
}
}
Output
whole part.fraction
1 1.4000000000000000
2 3.2000000000000000
3 7.D000000000000000
4 9.C400000000000000
...
57 14643E5AE44D12B.8F5FEE5AA432560D
58 32FA9BE33AC0AEC.E66FD3E29A7DD720
59 7F7285B812E1B50.401791B6823A99D0
60 9F4F2726179A224.501D762422C94044
^-------------^ This is the part OP is seeking.
The key to solving this task is: divide and conquer. Form an algorithm, (which is simply *5 and /2 as needed), and code a type and functions to do each small step.
Is a loop of 60 efficient? Perhaps not. Another approach would use Exponentiation by squaring. Certainly would be worth it for large N, yet for N == 60, a loop was simple enough for a quick turn.
5n = 2(-n) • 10n
Using this identity, we can easily compute the leading N base-2 digits of (the nearest integer to) any given power of 5.
This code example is in C, but it's the same idea in any other language.
Example output: https://wandbox.org/permlink/Fs205DDzQR0gaLSo
#include <assert.h>
#include <float.h>
#include <math.h>
#include <stdint.h>
#define STATIC_ASSERT(CONDITION) ((void)sizeof(int[(CONDITION) ? 1 : -1]))
uint64_t pow5_leading_digits(double power, uint8_t ndigits)
{
STATIC_ASSERT(DBL_MANT_DIG <= 64);
double pow5 = exp2(-power) * pow(10, power);
const double binary_digits = ceil(log2(pow5));
assert(ndigits <= DBL_MANT_DIG);
if (!ndigits || binary_digits < 0)
return 0;
// If pow5 can fit in the number of digits requested, return it
if (binary_digits <= ndigits)
return pow5;
// If pow5 is too big to return, divide by 2 until it fits
if (binary_digits > DBL_MANT_DIG)
pow5 /= exp2(binary_digits - DBL_MANT_DIG + 1);
return (uint64_t)pow5 >> (DBL_MANT_DIG - ndigits);
}
Edit: Now limits the returned value to those exactly representable with double's.

Arduino: Formula to convert byte

Im looking for a way to modify a binary byte value on Arduino.
Because of the Hardware, its neccesarry, to split a two digit number into 2 4-bit.
the code to set output is wire.write(byte, 0xFF) which sets all outputs on High.
0xFF = binary 1111 1111
the formula should be convert a value like this:
e.g nr 35 is binary 0010 0011
but for my use it should displayed as 0011 0101 which would be refer to 53 in reality.
The first 4 bits are for a BCD-Input IC which displays the 5 from 35, the second 4 bits are for a BCD-Input IC which displays the 3 from 35.
Does anybody has a idea how to convert this by code, or like a mathematical formula?
Possible numbers are from 00 to 59.
Thank you for your help
To convert a value n between 0 and 99 to BCD:
((n / 10) * 16) + (n % 10)
assuming n is an integer and thus / is doing integer division; also assumes this will be stored in an unsigned byte.
(If this is not producing the desired result, please either explain how it is incorrect for the example given, or provide a different example for which it is incorrect.)
#include <string.h>
int num = // Any number from 0 to 59
int tens = num/10;
int units = num-(tens*10);
// Make string array for binary
string tensbinary;
int quotient = tens;
char buffer[1];
// Convert numbers
for (int i = 0; i < 4; i++)
{
quotientint = quotientint % 2;
sprintf(buffer, 1, "%d", quotientint);
binary.append(buffer);
}
// Repeat above for the units
// Now join the two together
binarytens.append(binaryunits);
I don't know if this will work, but still, you might be able to extrapolate based on the available information in my code.
The last thing you need to do is convert the string to binary.

OpenCL kernel question

I have taken the Kernel from the great OpenCL SpMV article for AMD by Bryan Catanzaro.
I have given it a toy problem where the input is
A= [0 0 0 6 1 3 5 7 2 4 0 0]
offsets= [-3 0 2]
x= [1 2 3 4]
and the output y should be [7 22 15 34]
Here is the kernel:
__kernel
void dia_spmv(__global float *A, __const int rows,
__const int diags, __global int *offsets,
__global float *x, __global float *y) {
int row = get_global_id(0);
float accumulator = 0;
for(int diag = 0; diag < diags; diag++) {
int col = row + offsets[diag];
if ((col >= 0) && (col < rows)) {
float m = A[diag*rows + row];
float v = x[col];
accumulator += m * v;
}
}
y[row] = accumulator;
}
After loading and writing the input arguments I execute the kernel like this:
size_t global_work_size;
global_work_size = 4;
err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, &global_work_size,NULL, 0, NULL, NULL);
err = clFinish(cmd_queue);
And I get the correct result when I read y back from gpu memory.
I.e. I get y = [7 22 15 34]
I am new to OpenCL (and GPGPU in general) so I want to try and understand how to extend the problem correctly for much larger matrices of arbitrary dimension.
So lets say I have 1000 000 rows. What should I set global_work_size to be?
And should I set local_work_size or should I leave it as NULL?
To use the kernel for arbitrary matrix sizes you should think about the problem and rewrite the kernel. The issue is the limited memory size of the GPU and limited size for a single buffer. You can get the maximum size for a buffer with clGetDeviceInfo and CL_DEVICE_MAX_MEM_ALLOC_SIZE.
You need to split your problem into smaller pieces. Calculate them separately and merge the results afterwards.
I do not know the problem above and can not give you any hint which helps you to implement this. I can only give you the general direction.

Resources