Intrinsic optimization of SM4 encryption using AVX2 - encryption

The performance of SM4-cbc encryption is influenced by the feedback dependency. I was wondering how to improve the encryption speed by using AVX2 instrinsics or any other methods?
Below is the C code snippet of my implementation, of which the performance should be improved by 3x.
// rk, Arr* are all of uint32_t.
uint32_t T(uint32_t a1, uint32_t a2, uint32_t a3, uint32_t rk)
{
ka = a1 ^ a2 ^ a3 ^ rk;
uint32_t ka1 = (unsigned char)(ka),
ka2 = (unsigned char)(ka>>8),
ka3 = (unsigned char)(ka>>16),
ka4 = (unsigned char)(ka>>24);
return Arr1[ka1] ^ Arr2[ka2] ^ Arr3[ka3] ^ Arr4[ka4];
}
uint32_t one_round(uint32_t input[4])
{
uint32_t a0, a1, a2, a3;
a0 = input[0];
a1 = input[1];
a2 = input[2];
a3 = input[3];
// This block will repeat many times using different rk
a0 ^= T(a1, a2, a3, rk[0]);
a1 ^= T(a2, a3, a0, rk[1]);
a2 ^= T(a3, a0, a1, rk[2]);
a3 ^= T(a0, a1, a2, rk[3]);
}

Related

Barometer (BMP180) pressure measurement problem

I have a barometer (BMP180) and am interfacing it with aSTM32. I have to read the altitude which should have a precision of a foot.
While interfacing the barometer, I get a consistent real value of temperature but the pressure comes in the range of 59400 Pa (Pa - as per Bosch Datasheet). The example given in the datasheet also has pressure of 69964 Pa.
I have taken care of the delays required for the barometer to sample the data by using software based counters. Also, I checked my sensor with available libraries and everything comes consistent. But due to some academic reasons I can't use the libraries and so I need help in measuring the pressure.
Here is my STM32 Code:
//calibration params from the sensor
static int16_t AC1 = 8353;
static int16_t AC2 = -1041;
static int16_t AC3 = -14621;
static uint16_t AC4 = 33880;
static uint16_t AC5 = 25158;
static uint16_t AC6 = 18608;
static int16_t B1 = 6515;
static int16_t B2 = 35;
static int16_t MB = -32768;
static int16_t MC = -11786;
static int16_t MD = 2724;
static int TMPCNT = 0, PRSCNT=0; //loop counters
static double temperature, pressure;
static uint8_t oss = 3; // bmp high resolution
static long UT, X1, X2, B5, UP;
int main() {
while (1) {
if (TMPCNT == 2) { //2.7*2 > 5ms so sample temp every 5 ms
MSR_TEMP(); //measure for previous request
RQST_TEMP(); // request again
TMPCNT = 0;
}
TMPCNT++;
if (PRSCNT == 10) { //sample pressure every 27ms (25ms wait time for UH resolution)
MSR_PRESS();
RQST_PRESS();
PRSCNT = 0;
}
PRSCNT++;
/* some more code which takes 2.7 milliseconds */
}
}
void RQST_TEMP() {
i2c_transmit(BMP, 0xF4, 0x2E); // transmit 0x2E to 0xF4 reg of BMP addr
}
void MSR_TEMP() {
char buff[2];
requestData(BMP, 0xF6, buff, 2); //request 2 bytes from BMP addr, reg 0xF6, and put it in the buff
UT = (buff[0] << 8 | buff[1]);
X1 = (long)((UT - AC6) * AC5 * pow(2, -15));
X2 = (long)(MC * pow(2, 11) / (X1 + MD));
B5 = X1 + X2;
temperature = ((B5 + 8) * pow(2, -4)) / 10.0;
}
void RQST_PRESS(){
i2c_transmit(BMP, 0xF4, (char)(0x34 + (oss << 6)));
}
void MSR_PRESS() {
long B6,x1,x2,x3,B3,p;
unsigned long B4,B7;
char buff[3];
requestData(BMP, 0xF6, buff, 3);
UP = ((buff[0] << 16) + (buff[1] << 8) + buff[2]) >> (8-oss);
B6 = B5 - 4000;
x1 = (long)((B2 * (B6 * B6 * pow(2, -12))) * pow(2, -11));
x2 = (long)(AC2 * B6 * pow(2, -11));
x3 = x1 + x2;
B3 = ((((long)AC1 * 4 + x3) << oss) + 2) / 4;
x1 = AC3 * B6 * pow(2, -13);
x2 = (B1 * (B6 * B6 * pow(2, -12))) * pow(2, -16);
x3 = ((x1 + x2) + 2) * pow(2, -2);
B4 = AC4 * (unsigned long)(x3 + 32768) * pow(2, -15);
B7 = ((unsigned long)UP - B3) * (50000 >> oss);
if (B7 < 0x80000000) {
p = (B7 * 2) / B4;
} else {
p = (B7 / B4) * 2;
}
x1 = (p * pow(2, -8)) * (p * pow(2, -8));
x1 = (x1 * 3038) * pow(2, -16);
x2 = (-7357 * p) * pow(2, -16);
pressure = p + ((x1 + x2 + 3791) * pow(2, -4)); // pressure is in Pa as per data sheet
// SerialPrint("%lf\n",pressure);
}
The code written is based on the algorithm mentioned in the datasheet. Here is the datasheet. I dont know what I am doing wrong. My location should have pressure of 101209Pa which is way far from what I am getting. My I2C driver functions are also working as they should. So I cant figure out where exactly the problem is. I'll surely appreciate some help.

How to get the last octet of an IP address into 3 separate ints?

I need to get the last octet of an IP address into 3 separate ints to control a MAX7219.
I know the octet is an uint8_t
Using IPAddress ip = Wifi.localIP();, say my ip[3] was 148, I need:
int b1 = 1
int b2 = 4
int b3 = 8
but say ip[3] was only 8, then b1 and b2 have to be 0.
First thing that came to mind; there are probably more elegant ways to do this, but it works:
b3 = ip[3] % 10;
b2 = ((ip[3] % 100) - b3) / 10;
b1 = (ip[3] - (10 * b2) - b3) / 100;
I don't know why you would need separate integers, though.
final function; note no_green is the binary array with led bits
void showIP()
{
IPAddress ip = WiFi.localIP();
b3 = ip[3] % 10;
b2 = ((ip[3] % 100) - b3) / 10;
b1 = (ip[3] - (10 * b2) - b3) / 100;
// write bits to MAX7219
lc.setRow( 0, 6, byte( no_green[ b1] )); // hr leds position
lc.setRow( 0, 3, byte( no_green[ b2] )); // min leds position
lc.setRow( 0, 0, byte( no_green[ b3] )); // sec leds position
delay(6000);
}

Arduino pro mini A6/A7 as output

i,m trying to use pins A6 and A7 as digital output pins . but there is 1.5v (instead of 3V) on this pins as output high.
fp[0] = A1;
fp[1] = A2;
fp[2] = A3;
fp[3] = A6;
fp[4] = A7;
fp[5] = 6;
fp[6] = 7;
fp[7] = 8;
fp[8] = 9;
fp[9] = 10;
for (int i = 0; i < fp_size; i++) {
pinMode(fp[i], OUTPUT);
digitalWrite(fp[i], HIGH);
}
If you read the datasheet for AT328p, it indicates that A6 & A7 are analog exclusive pins. You can't use them as general purpose digital pins.

Does anyone know what the digital equivalent of analog pins 0-15 on a Mega 2560 is?

I am a beginner of Arduino. here I am facing a problem to declare analog pins. Can I use 97 to 82 int value instead of A0 to A15. If yes then how?
Can I use 97 to 82 int value instead of A0 to A15
No, that is not the correct range. From https://github.com/arduino/Arduino/blob/1.8.4/hardware/arduino/avr/variants/mega/pins_arduino.h#L51-L83:
#define PIN_A0 (54)
#define PIN_A1 (55)
#define PIN_A2 (56)
#define PIN_A3 (57)
#define PIN_A4 (58)
#define PIN_A5 (59)
#define PIN_A6 (60)
#define PIN_A7 (61)
#define PIN_A8 (62)
#define PIN_A9 (63)
#define PIN_A10 (64)
#define PIN_A11 (65)
#define PIN_A12 (66)
#define PIN_A13 (67)
#define PIN_A14 (68)
#define PIN_A15 (69)
static const uint8_t A0 = PIN_A0;
static const uint8_t A1 = PIN_A1;
static const uint8_t A2 = PIN_A2;
static const uint8_t A3 = PIN_A3;
static const uint8_t A4 = PIN_A4;
static const uint8_t A5 = PIN_A5;
static const uint8_t A6 = PIN_A6;
static const uint8_t A7 = PIN_A7;
static const uint8_t A8 = PIN_A8;
static const uint8_t A9 = PIN_A9;
static const uint8_t A10 = PIN_A10;
static const uint8_t A11 = PIN_A11;
static const uint8_t A12 = PIN_A12;
static const uint8_t A13 = PIN_A13;
static const uint8_t A14 = PIN_A14;
static const uint8_t A15 = PIN_A15;
So as you can see, the values of A0 through A15 are 54 through 69.
Can you use those numbers instead of the An pin names? Yes.
You can also use the analog channel number with analogRead().
For example the following three statements all read pin A0:
analogRead(A0)
and:
analogRead(54)
and:
analogRead(0)
But really it's less confusing just to use the pin names that are written on the silkscreen of your Arduino board.
If yes then how?
Exactly as you would with the An pin names.

Sum two entire vectors with CUDA

I´m trying to learn some cuda and I can't figure out how to solve the following situation:
Consider two groups G1 and G2:
G1 have 2 vectors with 3 elements each a1 = {2,5,8} and b1 =
{8,4,6}
G2 have 2 vectors with 3 elements each a2 = {7,3,1}
and b2 = {4,2,9}
The task is to sum vector a and b from each group and return a sorted c vector, so:
G1 will give c1 = {10,9,14) => (sort algorithm) => c1 = {9,10,14}
G2 will give c2 = {11,5,10) => (sort algorithm) => c1 = {5,10,11}
If I have a gforce with 92 cuda cores I would like to create 92 G groups and make all the sum in parallel so
core 1-> G1 -> c1 = a1 + b1 -> sort c1 -> return c1
core 2-> G2 -> c2 = a2 + b2 -> sort c2 -> return c2
....
core 92-> G92 -> c92 = a92 + b92 -> sort c92 -> return c92
The kernel below sum two vectors in parallel and return another one:
__global__ void add( int*a, int*b, int*c )
{
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
What I can´t understand is how make the kernel handle the entire vector not only one
element of the vector and them return an entire vector.
Something like this:
__global__ void add( int*a, int*b, int*c, int size )
{
for (int i = 0; i < size ; i++)
c[i] = a[i] + b[i];
//sort c
}
Can anyone please explain me if it is possible and how to do it?
This is a small example. It uses cudaMallocPitch and cudaMemcpy2D. I hope it will give you guidelines to solve your particular problem:
#include<stdio.h>
#include<cuda.h>
#include<cuda_runtime.h>
#include<device_launch_parameters.h>
#include<conio.h>
#define N 92
#define M 3
__global__ void test_access(float** d_a,float** d_b,float** d_c,size_t pitch1,size_t pitch2,size_t pitch3)
{
int idx = threadIdx.x;
float* row_a = (float*)((char*)d_a + idx*pitch1);
float* row_b = (float*)((char*)d_b + idx*pitch2);
float* row_c = (float*)((char*)d_c + idx*pitch3);
for (int i=0; i<M; i++) row_c[i] = row_a[i] + row_b[i];
printf("row %i column 0 value %f \n",idx,row_c[0]);
printf("row %i column 1 value %f \n",idx,row_c[1]);
printf("row %i column 2 value %f \n",idx,row_c[2]);
}
/********/
/* MAIN */
/********/
int main()
{
float a[N][M], b[N][M], c[N][M];
float **d_a, **d_b, **d_c;
size_t pitch1,pitch2,pitch3;
cudaMallocPitch(&d_a,&pitch1,M*sizeof(float),N);
cudaMallocPitch(&d_b,&pitch2,M*sizeof(float),N);
cudaMallocPitch(&d_c,&pitch3,M*sizeof(float),N);
for (int i=0; i<N; i++)
for (int j=0; j<M; j++) {
a[i][j] = i*j;
b[i][j] = -i*j+1;
}
cudaMemcpy2D(d_a,pitch1,a,M*sizeof(float),M*sizeof(float),N,cudaMemcpyHostToDevice);
cudaMemcpy2D(d_b,pitch2,b,M*sizeof(float),M*sizeof(float),N,cudaMemcpyHostToDevice);
test_access<<<1,N>>>(d_a,d_b,d_c,pitch1,pitch2,pitch3);
cudaMemcpy2D(c,M*sizeof(float),d_c,pitch3,M*sizeof(float),N,cudaMemcpyDeviceToHost);
for (int i=0; i<N; i++)
for (int j=0; j<M; j++) printf("row %i column %i value %f\n",i,j,c[i][j]);
getch();
return 0;
}
92 3-D vectors can be seen as 1 276-D vector, then you can use the single vector add kernel to add them. Thrust will be a more simple way to do this.
update
If your vector is only 3-D, you could simply sort the elements immediately after they are calculated, using sequential method.
If your vector has higher dimensions, you could consider use cub::BlockRadixSort. The idea is to first add one vector per thread/block, then sort the vector within the block using cub::BlockRadixSort.
http://nvlabs.github.io/cub/classcub_1_1_block_radix_sort.html

Resources