hex offset sector - hex

I'm getting a response from a nameserver which is longer then 512 bytes. in that response are some offsets. an offset from the beginning of the response is going fine, but when i get above 512 bytes the offset changes and it doesn't work anymore.
c0 0c = byte 12 from the start(works like a charm)
i have an offset:c1 f0 which means(in my knowledge so far)
c1 = 1 x 512 = 512
f0 = 240
c1 f0= byte 240 from byte 512 == byte 752
my offset should point to the beginning of a name, which should be located at byte 752
but at byte 752 the name isn't located.
Question
how does the offset work after 512 bytes?

It is a relative reference. In order to indicate that it is a relative reference, the first 2 bits are "reserved". You can reference a maximum of 14 bits: 2 bytes with the highest 2 bits are reserved. C0 01 is the reference offset 1. It does therefore not always have to be C0. it can also be C1, C2, C3, C4, CF etc. In practice this will be fairly rare unless you have a very complex long running queries which is the case. I have a query of 3000+ bytes:)
C1 = 11000001
strip 2 highest bits : 000001
number = 1
offset of C1 F0 is 1 x 256 + 240 = 496
offset of C9 9F is 9 x 256 + 159 = 2463
in one byte there are 256 combinations, not 512 which is used :S
The max of C0 is C0 FF which is 255. after that C1 00 starts
Credits of this explanation go to http://www.helpmij.nl/forum/member.php/215405-wampier

Related

Why does XOR and subtraction with the same HEX values give the same result?

For example:
7A - 20 = 5A
7A XOR 20 = 5A
Of course, this will work the same using different values. Why does this occur exactly?
It's only the same if there are no borrows,
i.e. no 0 - 1 at any bit-positions, only 1-0 = 1 or 1-1 = 0.
That's the same as saying that the first operand (the minuend) has set bits everywhere the second operand (subtrahend) does.
i.e. if x & y == y, then x-y == x^y.
The simplest counter-example:
0 - 1 = 0xFF - borrow propagates all the way to the top of the register.
0 ^ 1 = 0x01 - XOR is add-without-carry, it just flips the bits in one operand where a bit is set in the other operand. (i.e. you could look at it as flipping no bits in 1, leaving 1. Or as flipping the low bit in 0, producing 1.)
XOR is commutative (x ^ y == y ^ x), subtraction is not (x-y is usually different from y-x, except for special-case results like 0 or 0x80)
Repeating XOR with the same value undoes it, e.g. 5A ^ 20 = 7A flips the bit back on.
But repeating subtraction doesn't: 5A - 20 = 3A.

assembly BYTE PTR clarification

I'm working through some practice problems, and I'm stuck on one particular thing:
What is the decimal value in BYTE PTR value+2?
.data
list DWORD 50 DUP(?)
matrix DWORD 20 DUP(5 DUP(?))
value DWORD 10597059 ; decimalstring
BYTE "Computer Architecture",0
The base address of list is 1000H.
I know the answer is 161, but I'm not sure how to get to that spot. Can anyone help explain that process? (there is extra data info in there from other questions using the same data set, fyi).
Thanks!
The variable value has the decimal value of 10597059 - which is the DWORD value 00A1B2C3 in hexadecimal. Now, because the x86 architecture is of little-endian format, this value is stored in reverse order in memory:
0 1 2 3 0 1 2 3 ( position )
00 A1 B2 C3 will be stored as C3 B2 A1 00 ( value )
Here you can see that the single BYTE value pointed to by BYTE PTR value+2 is A1 hexadecimal = 161 decimal. Because BYTE PTR value points to the BYTE at position 0 and BYTE PTR value+2 points to the BYTE at position 2(relative to value), in the above illustration (the right one is how it is stored in memory). These BYTEs pointed to are only one quarter of the DWORD value, directly addressed.

What does these hex values mean?

I have a table of values of hex values (I think they are hex bytes?) and I'm trying to figure out what they mean in decimal.
On the website the author states that the highlighted values 43 7A 00 00 mean 250 in decimal, but when I input these into a hex to decimal converter I get 1132068864.
For the life of me I don't understand what's going on. I know that the naming above the highlighted values 1 2 3 4 5 6 7 8 9 A B C D E F are the hex system, but I don't understand how you read the values inside the table.
Help would be appreciated!
What's happening here is that the bytes 43 7A 00 00 are not being treated as an integer. They are being treated as an IEEE-format 32-bit floating-point number. This is why the Type column in the Inspector window in the image says Float. When those bytes are interpreted in that way they do indeed represent the value 250.0.
You can read about the details of the format at https://en.wikipedia.org/wiki/Single-precision_floating-point_format
In this particular case the bytes would be decoded as:
a "sign" bit of 0, meaning that the value is a positive number
an "exponent" field containing bits 1000 0110 (or hex 86, decimal 134), meaning that the exponent has the value 7 (calculated by subtracting 127 from the raw value of the field, 134)
a "significand" field containing bits 1111 1010 0000 0000 0000 0000 (or hex FA0000, decimal 16384000)
The significand and exponent are combined according to the formula:
value = ( significand * (2 ^ exponent) ) / (2 ^ 23)
where a ^ b means "a raised to the power b" .
In this case that gives:
value = ( 16384000 * 2^7 ) / 2^23
= ( 16384000 * 128 ) / 8388608
= 250.0

FMA instruction showing up as three packed double operations?

I'm analyzing a piece of linear algebra code which is calling intrinsics directly, e.g.
v_dot0 = _mm256_fmadd_pd( v_x0, v_y0, v_dot0 );
My test script computes the dot product of two double precision vectors of length 4 (so only one call to _mm256_fmadd_pd needed), repeated 1 billion times. When I count the number of operations with perf I get something as follows:
Performance counter stats for './main':
0 r5380c7 (skl::FP_ARITH:512B_PACKED_SINGLE) (49.99%)
0 r5340c7 (skl::FP_ARITH:512B_PACKED_DOUBLE) (49.99%)
0 r5320c7 (skl::FP_ARITH:256B_PACKED_SINGLE) (49.99%)
2'998'943'659 r5310c7 (skl::FP_ARITH:256B_PACKED_DOUBLE) (50.01%)
0 r5308c7 (skl::FP_ARITH:128B_PACKED_SINGLE) (50.01%)
1'999'928'140 r5304c7 (skl::FP_ARITH:128B_PACKED_DOUBLE) (50.01%)
0 r5302c7 (skl::FP_ARITH:SCALAR_SINGLE) (50.01%)
1'000'352'249 r5301c7 (skl::FP_ARITH:SCALAR_DOUBLE) (49.99%)
I was surprised that the number of 256B_PACKED_DOUBLE operations is approx. 3 billion, instead of 1 billion, as this is an instruction from my architecture's instruction set. Why does perf count 3 packed double operations per call to _mm256_fmadd_pd?
Note: to test that the code is not calling other floating point operations accidentally, I commented out the call to the above mentioned intrinsic, and perf counts exactly zero 256B_PACKED_DOUBLE operations, as expected.
Edit: MCVE, as requested:
ddot.c
#include <immintrin.h> // AVX
double ddot(int m, double *x, double *y) {
int ii;
double dot = 0.0;
__m128d u_dot0, u_x0, u_y0, u_tmp;
__m256d v_dot0, v_dot1, v_x0, v_x1, v_y0, v_y1, v_tmp;
v_dot0 = _mm256_setzero_pd();
v_dot1 = _mm256_setzero_pd();
u_dot0 = _mm_setzero_pd();
ii = 0;
for (; ii < m - 3; ii += 4) {
v_x0 = _mm256_loadu_pd(&x[ii + 0]);
v_y0 = _mm256_loadu_pd(&y[ii + 0]);
v_dot0 = _mm256_fmadd_pd(v_x0, v_y0, v_dot0);
}
// reduce
v_dot0 = _mm256_add_pd(v_dot0, v_dot1);
u_tmp = _mm_add_pd(_mm256_castpd256_pd128(v_dot0), _mm256_extractf128_pd(v_dot0, 0x1));
u_tmp = _mm_hadd_pd(u_tmp, u_tmp);
u_dot0 = _mm_add_sd(u_dot0, u_tmp);
_mm_store_sd(&dot, u_dot0);
return dot;
}
main.c:
#include <stdio.h>
double ddot(int, double *, double *);
int main(int argc, char const *argv[]) {
double x[4] = {1.0, 2.0, 3.0, 4.0}, y[4] = {5.0, 5.0, 5.0, 5.0};
double xTy;
for (int i = 0; i < 1000000000; ++i) {
ddot(4, x, y);
}
printf(" %f\n", xTy);
return 0;
}
I run perf as
sudo perf stat -e r5380c7 -e r5340c7 -e r5320c7 -e r5310c7 -e r5308c7 -e r5304c7 -e r5302c7 -e r5301c7 ./a.out
The disassembly of ddot looks as follows:
0000000000000790 <ddot>:
790: 83 ff 03 cmp $0x3,%edi
793: 7e 6b jle 800 <ddot+0x70>
795: 8d 4f fc lea -0x4(%rdi),%ecx
798: c5 e9 57 d2 vxorpd %xmm2,%xmm2,%xmm2
79c: 31 c0 xor %eax,%eax
79e: c1 e9 02 shr $0x2,%ecx
7a1: 48 83 c1 01 add $0x1,%rcx
7a5: 48 c1 e1 05 shl $0x5,%rcx
7a9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
7b0: c5 f9 10 0c 06 vmovupd (%rsi,%rax,1),%xmm1
7b5: c5 f9 10 04 02 vmovupd (%rdx,%rax,1),%xmm0
7ba: c4 e3 75 18 4c 06 10 vinsertf128 $0x1,0x10(%rsi,%rax,1),%ymm1,%ymm1
7c1: 01
7c2: c4 e3 7d 18 44 02 10 vinsertf128 $0x1,0x10(%rdx,%rax,1),%ymm0,%ymm0
7c9: 01
7ca: 48 83 c0 20 add $0x20,%rax
7ce: 48 39 c1 cmp %rax,%rcx
7d1: c4 e2 f5 b8 d0 vfmadd231pd %ymm0,%ymm1,%ymm2
7d6: 75 d8 jne 7b0 <ddot+0x20>
7d8: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
7dc: c5 ed 58 d0 vaddpd %ymm0,%ymm2,%ymm2
7e0: c4 e3 7d 19 d0 01 vextractf128 $0x1,%ymm2,%xmm0
7e6: c5 f9 58 d2 vaddpd %xmm2,%xmm0,%xmm2
7ea: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
7ee: c5 e9 7c d2 vhaddpd %xmm2,%xmm2,%xmm2
7f2: c5 fb 58 d2 vaddsd %xmm2,%xmm0,%xmm2
7f6: c5 f9 28 c2 vmovapd %xmm2,%xmm0
7fa: c5 f8 77 vzeroupper
7fd: c3 retq
7fe: 66 90 xchg %ax,%ax
800: c5 e9 57 d2 vxorpd %xmm2,%xmm2,%xmm2
804: eb da jmp 7e0 <ddot+0x50>
806: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
80d: 00 00 00
I just tested with an asm loop on SKL. An FMA instructions like vfmadd231pd ymm0, ymm1, ymm3 counts for 2 counts of fp_arith_inst_retired.256b_packed_double, even though it's a single uop!
I guess Intel really wanted a FLOP counter, not an instruction or uop counter.
Your 3rd 256-bit FP uop is probably coming from something else you're doing, like a horizontal sum that starts out doing a 256-bit shuffle and another 256-bit add, instead of reducing to 128-bit first. I hope you're not using _mm256_hadd_pd!
Test code inner loop:
$ asm-link -d -n "testloop.asm" # assemble with NASM -felf64 and link with ld into a static binary
mov ebp, 100000000 # setup stuff outside the loop
vzeroupper
0000000000401040 <_start.loop>:
401040: c4 e2 f5 b8 c3 vfmadd231pd ymm0,ymm1,ymm3
401045: c4 e2 f5 b8 e3 vfmadd231pd ymm4,ymm1,ymm3
40104a: ff cd dec ebp
40104c: 75 f2 jne 401040 <_start.loop>
$ taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,fp_arith_inst_retired.256b_packed_double -r4 ./"$t"
Performance counter stats for './testloop-cvtss2sd' (4 runs):
102.67 msec task-clock # 0.999 CPUs utilized ( +- 0.00% )
2 context-switches # 24.510 M/sec ( +- 20.00% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 22.059 M/sec ( +- 11.11% )
400,388,898 cycles # 3925381.355 GHz ( +- 0.00% )
100,050,708 branches # 980889291.667 M/sec ( +- 0.00% )
400,256,258 instructions # 1.00 insn per cycle ( +- 0.00% )
300,377,737 uops_issued.any # 2944879772.059 M/sec ( +- 0.00% )
300,389,230 uops_executed.thread # 2944992450.980 M/sec ( +- 0.00% )
400,000,000 fp_arith_inst_retired.256b_packed_double # 3921568627.451 M/sec
0.1028042 +- 0.0000170 seconds time elapsed ( +- 0.02% )
400M counts of fp_arith_inst_retired.256b_packed_double for 200M FMA instructions / 100M loop iterations.
(IDK what up with perf 4.20.g8fe28c + kernel 4.20.3-arch1-1-ARCH. They calculate per-second stuff with the decimal in the wrong place for the unit. e.g. 3925381.355 kHz is correct, not GHz. Not sure if it's a bug in perf or the kernel.
Without vzeroupper, I'd sometimes see a latency of 5 cycles, not 4, for FMA. IDK if the kernel left a register in a polluted state or something.
Why do I get three though, and not two? (see MCVE added to original post)
Your ddot4 runs _mm256_add_pd(v_dot0, v_dot1); at the start of the cleanup, and since you call it with size=4, you get the cleanup once per FMA.
Note that your v_dot1 is always zero (because you didn't actually unroll with 2 accumulators like you're planning to?) So this is pointless, but the CPU doesn't know that. My guess was wrong, it's not a 256-bit hadd, it's just a useless 256-bit vertical add.
(For larger vectors, yes multiple accumulators are very valuable to hide FMA latency. You'll want at least 8 vectors. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about unrolling with multiple accumulators. But then you'll want a cleanup loop that does 1 vector at a time until you're down to the last up-to-3 elements.)
Also, I think your final _mm_add_sd(u_dot0, u_tmp); is actually a bug: you've already added the last pair of elements with an inefficient 128-bit hadd, so this double-counts the lowest element.
See Get sum of values stored in __m256d with SSE/AVX for a way that doesn't suck.
Also note that GCC is splitting your unaligned loads into 128-bit halves with vinsertf128 because you compiled with the default -mtune=generic (which favours Sandybridge) instead of using -march=haswell to enable AVX+FMA and set -mtune=haswell. (Or use -march=native)

Floating point data format sign+exponent

I am receiving data over UART from a heat meter, but I need some help to understand how i should deal with the data.
I have the documentation but that is not enough for me, I have to little experience with this kind of calculations.
Maybe someone with the right skill could explain to me how it should be done with a better example that I have from the documentation.
One value consists of the following bytes:
[number of bytes][sign+exponent] (integer)
(integer) is the register data value. The length of the integer value is
specified by [number of bytes]. [sign+exponent] is an 8-bit value that
specifies the sign of the data value and sign and value of the exponent. The
meaning of the individual bits in the [sign+exponent] byte is shown below:
Examples:
-123.45 = 04h, C2h, 0h, 0h, 30h, 39h
87654321*103 = 04h, 03h , 05h, 39h, 7Fh, B1h
255*103 = 01h, 03h , FFh
And now to one more example with actual data.
This is the information that I have from the documentation about this.
This is some data that I have received from my heat meter
10 00 56 25 04 42 00 00 1B E4
So in my example then 04 is the [number of bytes], 42 is the [sign+exponent] and 00 00 1B E4 is the (integer).
But I do not know how I should make the calculation to receive the actual value.
Any help?
Your data appears to be big-endian, according to your example. So here's how you break those bytes into the fields you need using bit shifting and masking.
n = b[0]
SI = (b[1] & 0x80) >> 7
SE = (b[1] & 0x40) >> 6
exponent = b[1] & 0x3f
integer = 0
for i = 0 to n-1:
integer = (integer << 8) + b[2+i]
The sign of the mantissa is obtained from the MSb of the Sign+exponent byte, by masking (byte & 80h != 0 => SI = -1).
The sign of the exponent is similarly obtained by byte & 40h != 0 => SE = -1.
The exponent value is EXP = byte & 3Fh.
The mantissa INT is the binary number formed by the four other bytes, which can be read as a single integer (but mind the indianness).
Finally, compute SI * INT * pow(10, SE * EXP).
In your example, SI = 1, SE = -1, EXP = 2, INT = 7140, hence
1 * 7140 * pow(10, -1 * 2) = +71.4
It is not in the scope of this answer to explain how to implement this efficiently.

Resources