get ecc public key from x and y components in PEM format using openssl - encryption

Can I get the ecc public key from x and y components in PEM format using openssl?
X:
1d 43 15 e3 84 99 d6 f6 9f 49 61 8a ae ec f2 4f
Y:
b5 1a 86 cf f9 0e 01 af 3a 9a 52 b3 c6 58 2c 48
thank you!!!!!!!!!!!

Yes it is possible. Here an example in C.
int main(void)
{
EC_GROUP *group;
EC_POINT *point;
EC_KEY *key;
BIGNUM *x, *y;
BIO *out;
ERR_load_crypto_strings();
OpenSSL_add_all_algorithms();
group = EC_GROUP_new_by_curve_name(NID_secp256k1);
x = BN_new();
y = BN_new();
BN_hex2bn(&x, "1d4315e38499d6f69f49618aaeecf24f");
BN_hex2bn(&y, "b51a86cff90e01af3a9a52b3c6582c48");
/* create EC point from X and Y */
point = EC_POINT_new(group);
EC_POINT_set_affine_coordinates_GFp(group, point, x, y, NULL);
/* Create a new EC key and set the public key */
key = EC_KEY_new();
EC_KEY_set_group(key, group);
EC_KEY_set_public_key(key, point);
out = BIO_new(BIO_s_file());
BIO_set_fp(out, stdout, BIO_NOCLOSE);
PEM_write_bio_EC_PUBKEY(out, key);
/* Clean up */
BN_free(x);
BN_free(y);
EC_POINT_free(point);
EC_GROUP_free(group);
EC_KEY_free(key);
BIO_free(out);
return 0;
}

Related

Z80 division algorithm not functioning properly

I am attempting to run the following code:
HLDIVC:
LD B,16
D0: XOR A
ADD HL,HL
RLA
CP C
JR C, D1
INC L
SUB C
DJNZ D0
D1: RET
It's an adaptation of the original code: (found here)
HL_Div_C:
;Inputs:
; HL is the numerator
; C is the denominator
;Outputs:
; A is the remainder
; B is 0
; C is not changed
; DE is not changed
; HL is the quotient
;
ld b,16
xor a
add hl,hl
rla
cp c
jr c,$+4
inc l
sub c
djnz $-7
ret
On an old pocket computer I have. I had to edit the code a little bit, namely because it would seem the assembler on this pocket computer simply does not support the "jr c,$+4" syntax and rather must use labels or absolute addresses. It would seem this may be causing issues, however, because the algorithm does not seem to be working properly. I am calling the function with the following code:
ORG O100H
LD HL,20
LD C,10
CALL REGOUT; Display all register values
CALL HLDIVC
CALL REGOUT
RET
With this I am trying to divide 20 by 10, so after calling the function the correct value in HL should be 2, and the value in A (The remainder) should be 0, from my understanding. This is not the case, however. Before running the HLDIVC program, these are the register values:
| PC = 0107 | AF = 00 44 |
| SP = 7FE8 | BC = 00 0A |
| IX = 7C06 | DE = 00 14 |
| IY = 7C0C | HL = 00 14 |
(All values are hexadecimal)
After running the program, these are the register values:
| PC = 010D | AF = 00 9B | <- A is correct
| SP = 7FE8 | BC = 10 0A | <- B is supposed to be 0
| IX = 7C06 | DE = 00 14 | <- DE is correct
| IY = 7C0C | HL = 00 14 | <- HL should be 2(?)
What's going on? Any help would be much appreciated, thank you for your time.
The problem with your code is that $+4 and $-7 are both referring to byte counts, not instruction counts, and the JR instruction is 2 bytes. The indentation gives you a clue. You need to move your labels:
HLDIVC:
LD B,16
XOR A
D0: ADD HL,HL
RLA
CP C
JR C, D1
INC L
SUB C
D1: DJNZ D0
RET

Qt: convert decimal to hex with 4 Bytes

I am using Qt to convert a decimal into a hex string
QString hexvalue = QString("%1").arg(decimal, 8, 16, QLatin1Char( '0' ));
I want to have
1: 00 00 00 01
-1: FF FF FF FF
This code however results in
FF FF FF FF FF FF FF FF and 00 00 00 01
How can I limit this to 4 Bytes?
You can use a mask with an AND operator to do this :
int val = -1;
qDebug() << QString("%1 : %2").arg(val).arg(val & 0xffffffff, 8, 16, QLatin1Char('0'));
will display
"-1 : ffffffff"
EDIT
As requested in comments, here is one way to have a variable length in the range 0 < length <= 8:
int mask = 0xffffffff >> (32 - 4 * length); // assuming a 32 bit integer
int val = -1;
qDebug() << QString("%1 : %2").arg(val).arg((unsigned int)(val & mask), length, 16, QLatin1Char('0'));

FMA instruction showing up as three packed double operations?

I'm analyzing a piece of linear algebra code which is calling intrinsics directly, e.g.
v_dot0 = _mm256_fmadd_pd( v_x0, v_y0, v_dot0 );
My test script computes the dot product of two double precision vectors of length 4 (so only one call to _mm256_fmadd_pd needed), repeated 1 billion times. When I count the number of operations with perf I get something as follows:
Performance counter stats for './main':
0 r5380c7 (skl::FP_ARITH:512B_PACKED_SINGLE) (49.99%)
0 r5340c7 (skl::FP_ARITH:512B_PACKED_DOUBLE) (49.99%)
0 r5320c7 (skl::FP_ARITH:256B_PACKED_SINGLE) (49.99%)
2'998'943'659 r5310c7 (skl::FP_ARITH:256B_PACKED_DOUBLE) (50.01%)
0 r5308c7 (skl::FP_ARITH:128B_PACKED_SINGLE) (50.01%)
1'999'928'140 r5304c7 (skl::FP_ARITH:128B_PACKED_DOUBLE) (50.01%)
0 r5302c7 (skl::FP_ARITH:SCALAR_SINGLE) (50.01%)
1'000'352'249 r5301c7 (skl::FP_ARITH:SCALAR_DOUBLE) (49.99%)
I was surprised that the number of 256B_PACKED_DOUBLE operations is approx. 3 billion, instead of 1 billion, as this is an instruction from my architecture's instruction set. Why does perf count 3 packed double operations per call to _mm256_fmadd_pd?
Note: to test that the code is not calling other floating point operations accidentally, I commented out the call to the above mentioned intrinsic, and perf counts exactly zero 256B_PACKED_DOUBLE operations, as expected.
Edit: MCVE, as requested:
ddot.c
#include <immintrin.h> // AVX
double ddot(int m, double *x, double *y) {
int ii;
double dot = 0.0;
__m128d u_dot0, u_x0, u_y0, u_tmp;
__m256d v_dot0, v_dot1, v_x0, v_x1, v_y0, v_y1, v_tmp;
v_dot0 = _mm256_setzero_pd();
v_dot1 = _mm256_setzero_pd();
u_dot0 = _mm_setzero_pd();
ii = 0;
for (; ii < m - 3; ii += 4) {
v_x0 = _mm256_loadu_pd(&x[ii + 0]);
v_y0 = _mm256_loadu_pd(&y[ii + 0]);
v_dot0 = _mm256_fmadd_pd(v_x0, v_y0, v_dot0);
}
// reduce
v_dot0 = _mm256_add_pd(v_dot0, v_dot1);
u_tmp = _mm_add_pd(_mm256_castpd256_pd128(v_dot0), _mm256_extractf128_pd(v_dot0, 0x1));
u_tmp = _mm_hadd_pd(u_tmp, u_tmp);
u_dot0 = _mm_add_sd(u_dot0, u_tmp);
_mm_store_sd(&dot, u_dot0);
return dot;
}
main.c:
#include <stdio.h>
double ddot(int, double *, double *);
int main(int argc, char const *argv[]) {
double x[4] = {1.0, 2.0, 3.0, 4.0}, y[4] = {5.0, 5.0, 5.0, 5.0};
double xTy;
for (int i = 0; i < 1000000000; ++i) {
ddot(4, x, y);
}
printf(" %f\n", xTy);
return 0;
}
I run perf as
sudo perf stat -e r5380c7 -e r5340c7 -e r5320c7 -e r5310c7 -e r5308c7 -e r5304c7 -e r5302c7 -e r5301c7 ./a.out
The disassembly of ddot looks as follows:
0000000000000790 <ddot>:
790: 83 ff 03 cmp $0x3,%edi
793: 7e 6b jle 800 <ddot+0x70>
795: 8d 4f fc lea -0x4(%rdi),%ecx
798: c5 e9 57 d2 vxorpd %xmm2,%xmm2,%xmm2
79c: 31 c0 xor %eax,%eax
79e: c1 e9 02 shr $0x2,%ecx
7a1: 48 83 c1 01 add $0x1,%rcx
7a5: 48 c1 e1 05 shl $0x5,%rcx
7a9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
7b0: c5 f9 10 0c 06 vmovupd (%rsi,%rax,1),%xmm1
7b5: c5 f9 10 04 02 vmovupd (%rdx,%rax,1),%xmm0
7ba: c4 e3 75 18 4c 06 10 vinsertf128 $0x1,0x10(%rsi,%rax,1),%ymm1,%ymm1
7c1: 01
7c2: c4 e3 7d 18 44 02 10 vinsertf128 $0x1,0x10(%rdx,%rax,1),%ymm0,%ymm0
7c9: 01
7ca: 48 83 c0 20 add $0x20,%rax
7ce: 48 39 c1 cmp %rax,%rcx
7d1: c4 e2 f5 b8 d0 vfmadd231pd %ymm0,%ymm1,%ymm2
7d6: 75 d8 jne 7b0 <ddot+0x20>
7d8: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
7dc: c5 ed 58 d0 vaddpd %ymm0,%ymm2,%ymm2
7e0: c4 e3 7d 19 d0 01 vextractf128 $0x1,%ymm2,%xmm0
7e6: c5 f9 58 d2 vaddpd %xmm2,%xmm0,%xmm2
7ea: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
7ee: c5 e9 7c d2 vhaddpd %xmm2,%xmm2,%xmm2
7f2: c5 fb 58 d2 vaddsd %xmm2,%xmm0,%xmm2
7f6: c5 f9 28 c2 vmovapd %xmm2,%xmm0
7fa: c5 f8 77 vzeroupper
7fd: c3 retq
7fe: 66 90 xchg %ax,%ax
800: c5 e9 57 d2 vxorpd %xmm2,%xmm2,%xmm2
804: eb da jmp 7e0 <ddot+0x50>
806: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
80d: 00 00 00
I just tested with an asm loop on SKL. An FMA instructions like vfmadd231pd ymm0, ymm1, ymm3 counts for 2 counts of fp_arith_inst_retired.256b_packed_double, even though it's a single uop!
I guess Intel really wanted a FLOP counter, not an instruction or uop counter.
Your 3rd 256-bit FP uop is probably coming from something else you're doing, like a horizontal sum that starts out doing a 256-bit shuffle and another 256-bit add, instead of reducing to 128-bit first. I hope you're not using _mm256_hadd_pd!
Test code inner loop:
$ asm-link -d -n "testloop.asm" # assemble with NASM -felf64 and link with ld into a static binary
mov ebp, 100000000 # setup stuff outside the loop
vzeroupper
0000000000401040 <_start.loop>:
401040: c4 e2 f5 b8 c3 vfmadd231pd ymm0,ymm1,ymm3
401045: c4 e2 f5 b8 e3 vfmadd231pd ymm4,ymm1,ymm3
40104a: ff cd dec ebp
40104c: 75 f2 jne 401040 <_start.loop>
$ taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,fp_arith_inst_retired.256b_packed_double -r4 ./"$t"
Performance counter stats for './testloop-cvtss2sd' (4 runs):
102.67 msec task-clock # 0.999 CPUs utilized ( +- 0.00% )
2 context-switches # 24.510 M/sec ( +- 20.00% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 22.059 M/sec ( +- 11.11% )
400,388,898 cycles # 3925381.355 GHz ( +- 0.00% )
100,050,708 branches # 980889291.667 M/sec ( +- 0.00% )
400,256,258 instructions # 1.00 insn per cycle ( +- 0.00% )
300,377,737 uops_issued.any # 2944879772.059 M/sec ( +- 0.00% )
300,389,230 uops_executed.thread # 2944992450.980 M/sec ( +- 0.00% )
400,000,000 fp_arith_inst_retired.256b_packed_double # 3921568627.451 M/sec
0.1028042 +- 0.0000170 seconds time elapsed ( +- 0.02% )
400M counts of fp_arith_inst_retired.256b_packed_double for 200M FMA instructions / 100M loop iterations.
(IDK what up with perf 4.20.g8fe28c + kernel 4.20.3-arch1-1-ARCH. They calculate per-second stuff with the decimal in the wrong place for the unit. e.g. 3925381.355 kHz is correct, not GHz. Not sure if it's a bug in perf or the kernel.
Without vzeroupper, I'd sometimes see a latency of 5 cycles, not 4, for FMA. IDK if the kernel left a register in a polluted state or something.
Why do I get three though, and not two? (see MCVE added to original post)
Your ddot4 runs _mm256_add_pd(v_dot0, v_dot1); at the start of the cleanup, and since you call it with size=4, you get the cleanup once per FMA.
Note that your v_dot1 is always zero (because you didn't actually unroll with 2 accumulators like you're planning to?) So this is pointless, but the CPU doesn't know that. My guess was wrong, it's not a 256-bit hadd, it's just a useless 256-bit vertical add.
(For larger vectors, yes multiple accumulators are very valuable to hide FMA latency. You'll want at least 8 vectors. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about unrolling with multiple accumulators. But then you'll want a cleanup loop that does 1 vector at a time until you're down to the last up-to-3 elements.)
Also, I think your final _mm_add_sd(u_dot0, u_tmp); is actually a bug: you've already added the last pair of elements with an inefficient 128-bit hadd, so this double-counts the lowest element.
See Get sum of values stored in __m256d with SSE/AVX for a way that doesn't suck.
Also note that GCC is splitting your unaligned loads into 128-bit halves with vinsertf128 because you compiled with the default -mtune=generic (which favours Sandybridge) instead of using -march=haswell to enable AVX+FMA and set -mtune=haswell. (Or use -march=native)

How to find F(x,0) when F(x,i) = F(x-1,i) xor F(x-1, i+1) xor ... F(x-1,n) in less than linear time

Given a base array, I need to compute the value of a function given below:
A[] = { a0, a1, a2, a3, .. an }
F(0,i) = ai [Base case]
F(1,i) = F(0,i) xor F(0,i+1) xor F(0,i+2) ... xor F(0,n)
F(2,i) = F(1,i) xor F(1,i+1) xor F(1,i+2) ... xor F(1,n)
.
.
F(x,i) = F(x-1,i) xor F(x-1,i+1) xor F(x-1,i+2) ... xor F(x-1,n)
0 < x < 10^18
0 < n < 10^5
I need to find F(x,0).
I am trying to solve this equation for the past 3 days. I have failed to optimise it and come up with a feasible solution. Any help to find F(x,0) in less than linear time is appreciated.
My observation (if it is of any importance) :
F(0,0) = a0
F(1,0) = a0^a1^a2^a3 ....
F(2,0) = a0^a2^a4 ....
F(3,0) = (a0^a4^a8...) ^ (a1^a5^a9...)
F(4,0) = a0^a4^a8 ....
F(5,0) = (a0^a1^a2^a3) ^ (a8^a9^a10^a11) ^ (a16^..a19) ^ ...
F(6,0) = (a0^a8^a16...) ^ (a2^a10^a18...)
F(7,0) = (a0^a8^a16...) ^ (a1^a9^a17...)
F(8,0) = a0^a8^a16^ ....
Maybe it becomes easier if you reverse the array, with the recurrence relation
F(0, i) = a[n-i]
F(x, i) = XOR[0 <= j < i]( F(x-1, j) )
that is, going from 0 to i instead of i to n. We are then looking for F(x, n). If we make a table:
| 0 1 2 3
--+-------------------------------------------------------------------------------------------
0 | A=a[n] B=a[n-1] C=a[n-2] D=a[n-3]
1 | A A^B A^B^C A^B^C^D
2 | A A^(A^B) A^(A^B)^(A^B^C) A^(A^B)^(A^B^C)^(A^B^C^D)^B^C^D
3 | A A^(A^(A^B)) A^(A^(A^B))^(A^(A^B)^(A^B^C)) A^(A^(A^B))^(A^(A^B)^(A^B^C))^(A^(A^B)^(A^B^C)^(A^B^C^D))
or let's stop using the XOR sign and just count how often each term appears:
| 0 1 2 3 4
--+-----------------------------------------------
0 | 1A 1B 1C 1D 1E
1 | 1A 1A1B 1A1B1C 1A 1B1C1D 1A 1B 1C1D1E
2 | 1A 2A1B 3A2B1C 4A 3B2C1D 5A 4B 3C2D1E
3 | 1A 3A1B 6A3B1C 10A 6B3C1D 15A10B 6C3D1E
4 | 1A 4A1B 10A4B1C 20A10B4C1D 35A20B10C4D1E
We can also see that the recurrence relation is the same as
F(x, i) = F(x-1, i) + F(x, i-1)
This is Pascal's Triangle. Specifically, the values we are after are the nth diagonal, which can of course be calculated on its own. Then check whether the number is even or odd, so that you know whether the XOR operations on that element cancel themselves out or not.

hex offset sector

I'm getting a response from a nameserver which is longer then 512 bytes. in that response are some offsets. an offset from the beginning of the response is going fine, but when i get above 512 bytes the offset changes and it doesn't work anymore.
c0 0c = byte 12 from the start(works like a charm)
i have an offset:c1 f0 which means(in my knowledge so far)
c1 = 1 x 512 = 512
f0 = 240
c1 f0= byte 240 from byte 512 == byte 752
my offset should point to the beginning of a name, which should be located at byte 752
but at byte 752 the name isn't located.
Question
how does the offset work after 512 bytes?
It is a relative reference. In order to indicate that it is a relative reference, the first 2 bits are "reserved". You can reference a maximum of 14 bits: 2 bytes with the highest 2 bits are reserved. C0 01 is the reference offset 1. It does therefore not always have to be C0. it can also be C1, C2, C3, C4, CF etc. In practice this will be fairly rare unless you have a very complex long running queries which is the case. I have a query of 3000+ bytes:)
C1 = 11000001
strip 2 highest bits : 000001
number = 1
offset of C1 F0 is 1 x 256 + 240 = 496
offset of C9 9F is 9 x 256 + 159 = 2463
in one byte there are 256 combinations, not 512 which is used :S
The max of C0 is C0 FF which is 255. after that C1 00 starts
Credits of this explanation go to http://www.helpmij.nl/forum/member.php/215405-wampier

Resources