OpenCl: How do I make use of INT4/8? - opencl
I am looking for ways to improve the efficiency of an algorithm based on OpenCl.
Currently I use float and int datatypes on a Radeon VII card. However, a datatype covering numbers between -8 to +7 would be sufficient.
According to following article the Radeon VII achieves peak performance of 53/110 TFlops when restricting to INT8/INT4, which is much higher than with float which is 14 TFlops.
https://www.pcgameshardware.de/Radeon-VII-Grafikkarte-268194/Tests/Benchmark-Review-1274185/2/
So my question is how can I make use of INT8/4 operations? Simply use datatype char instead of int in OpenCl? Since char is the smallest buildin datatype, how can I even use INT4?
For "int8", that is, 8-bit integers, the OpenCL type is indeed char (signed, -128 to +127) or uchar (unsigned, 0 to 255). Not to be confused with the OpenCL type int8, which is a vector of 8 32-bit integers.
For decent performance you may want to use vector versions of these such as char4 or char16, though this should be driven based on your performance measurements, not guessing.
Note that you'll need to be aware of overflow behaviour, and especially for multiplications, you may need to perform intermediate operations on 16-bit values. (short/ushort/short4/ushort16/etc.) OpenCL also provides "saturating" addition and subtraction and a few other helpful integer built-in functions.
I'm not aware of any "native" support for packed 4-bit integer maths in OpenCL or any of the other GPGPU frameworks, or even any extensions. Maybe someone with experience with this can chip in, but my guess is that you'll effectively need to unpack uchar values using bit shifts and masking, perform your operations on uchar values and then pack back into 4-bit nibbles for storing. The speed boost will likely come from the fact that you can safely multiply using 8-bit logic rather than needing 16 bits to catch the overflow.
I made a test with some kernels to see if there is any difference in performance between int8 and char8:
typedef int8 type_msg;
//typedef char8 type_msg;
#define convert_type_msg(x) convert_int8(x)
__kernel void some_operation(__global type_msg *in_buff,
__global type_msg *out_buff)
{
out_buff[get_global_id(0)] = in_buff[get_global_id(0)] +(type_msg)(2);
}
First, to see what happens on GPU I used CodeXL to get the assembler code.
Here is part of the assembler code where int8 is used:
global_load_dwordx4 v[4:7], v[2:3], off
global_load_dwordx4 v[8:11], v[2:3], off inst_offset:16
v_add_co_u32 v0, vcc, s6, v0
v_mov_b32 v2, s7
v_addc_co_u32 v1, vcc, v2, v1, vcc
s_waitcnt vmcnt(0)
v_add_u32 v8, 2, v8
v_add_u32 v9, 2, v9
v_add_u32 v10, 2, v10
v_add_u32 v11, 2, v11
global_store_dwordx4 v[0:1], v[8:11], off inst_offset
v_add_u32 v2, 2, v4
v_add_u32 v3, 2, v5
v_add_u32 v4, 2, v6
v_add_u32 v5, 2, v7
global_store_dwordx4 v[0:1], v[2:5], off
And here is part of the assembler code where char8 is used:
global_load_dwordx2 v[2:3], v[2:3], off
s_waitcnt vmcnt(0)
v_lshlrev_b32 v4, 8, v3 src1_sel:BYTE_3
v_lshrrev_b32 v5, 8, v3
v_add_u32 v6, 2, v3 src1_sel:WORD_1
v_add_u32 v4, 0x00000200, v4
s_movk_i32 s0, 0x00ff
v_lshlrev_b32 v7, 8, v2 src1_sel:BYTE_3
v_add_u32 v5, 2, v5
v_bfi_b32 v4, s0, v6, v4
s_mov_b32 s1, 0x02010004
v_lshrrev_b32 v6, 8, v2
v_add_u32 v8, 2, v2 src1_sel:WORD_1
v_add_u32 v7, 0x00000200, v7
v_add_u32 v3, 2, v3
v_perm_b32 v4, v5, v4, s1
v_add_u32 v5, 2, v6
v_bfi_b32 v6, s0, v8, v7
v_add_co_u32 v0, vcc, s6, v0
v_mov_b32 v7, s7
v_addc_co_u32 v1, vcc, v7, v1, vcc
v_perm_b32 v3, v3, v4, s1
v_add_u32 v2, 2, v2
v_perm_b32 v4, v5, v6, s1
v_perm_b32 v2, v2, v4, s1
global_store_dword v[0:1], v3, off inst_offset:4
global_store_dword v[0:1], v2, off
I am no expert in assembly language, but as far as I can tell, in both cases there are 8 additions carried out using the v_add_u32 operation. Also char8 seems to require more operations like v_perm_b32 and v_bfi_b32. Maybe someone can explain what these are doing.
The only benefit of using char8 seems to be, that less global memory access is needed. E.g. there is only one global_load_dwordx2 access for char8 but 2 global_load_dwordx4 accesses for int8.
Thus in terms of performance maybe char8 is a little bit slower for computational bounded algorithms but faster for memory bounded algorithms.
To verify the analysis I build up a little experiment, where arithmetic is the bottleneck. To make sure that the compiler does not simplify the for-loop too much, I added some branching inside of it.
typedef int8 type_msg;
#define convert_type_msg(x) convert_int8(x)
//typedef char8 type_msg;
//#define convert_type_msg(x) convert_char8(x)
__kernel void some_complex_operation(__global char8 *in_buff,
__global char8 *out_buff)
{
type_msg res = in_buff[get_global_id(0)];
for(int i=0; i<1000000; i++)
{
res += select((type_msg)(-1), (type_msg)(4), res<(type_msg)100);
}
out_buff[get_global_id(0)] =(type_msg) res;
}
On my system the average time (running 100 times) for
int8 is 0.0558 sec
char8 is 0.0754 sec
short8 is 0.0738 sec
long8 is 0.1105 sec
So char8 consumes roughly 35 % more time. This confirms the observation, that in assembly language more instructions are generated for char8. However, some professional explanation for the addtional assembly statements would be nice.
Related
Is there a way to map output of AES algorithm to hex display on nexys a7?
I am implementing the AES algorithm on a Nexys A7 and I don't understand how to display the output. How can I display the first 4 bytes or the last 4 bytes of the output on the hex display? I've included the testbench and the AES module. include "round_last.v" module AES(key_out, state_out, key_in, state_in); input [0:127] state_in, key_in; output [0:127] state_out, key_out; wire [0:127] key1, key2, key3, key4, key5, key6, key7, key8, key9; wire [0:127] s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s_out; assign s0 = state_in ^ key_in; //assign state_in = 128'h00_11_22_33_44_55_66_77_88_99_aa_bb_cc_dd_ee_ff; //assign key_in = 128'h00_01_02_03_04_05_06_07_08_09_0a_0b_0c_0d_0e_0f; round r1(key1, s1, 4'h1, s0, key_in); //round(rkey, state_out, rn, state, prkey) ; round r2(key2, s2, 4'h2, s1, key1); round r3(key3, s3, 4'h3, s2, key2); round r4(key4, s4, 4'h4, s3, key3); round r5(key5, s5, 4'h5, s4, key4); round r6(key6, s6, 4'h6, s5, key5); round r7(key7, s7, 4'h7, s6, key6); round r8(key8, s8, 4'h8, s7, key7); round r9(key9, s9, 4'h9, s8, key8); round_last r10(key_out, state_out, 4'hA, s9, key9); endmodule `timescale 1ns/1ps `include "AES.v" module AES_tb; reg [0:127] state, key; wire [0:127] s_out, k_out; always#(state); AES crypt(k_out, s_out, key, state); initial #50 ; initial begin $monitor("a = %h, b= %h, c= %h, d=%h", state, key, s_out, k_out); $monitor("a = %h, b= %h, c= %h, d=%h", state, key, s_out, k_out); $dumpfile("AES_tb.vcd"); $dumpvars(0, AES_tb); //#0 state = 128'h00_04_12_14_12_04_12_00_0C_00_13_11_08_23_19_19 ; //#0 key = 128'h0F_15_71_C9_47_D9_E8_59_0C_B7_AD_D6_AF_7F_67_98 ; //#0 state = 128'h54_77_6F_20_4F_6E_65_20_4E_69_6E_65_20_54_77_6F; //#0 key = 128'h54_68_61_74_73_20_6D_79_20_4B_75_6E_67_20_46_75; #0 state = 128'h00_11_22_33_44_55_66_77_88_99_aa_bb_cc_dd_ee_ff; #0 key = 128'h00_01_02_03_04_05_06_07_08_09_0a_0b_0c_0d_0e_0f; $finish; end endmodule
Instant SoC has a 7-Segment driver. i.e. you can do that with a couple of C++ lines. With the timer etc it is very easy to scroll your result on the display. Use the class FC_IO_SegmentDisplay. There is an example how to use it. I have used this on the Nexus A7 board and it is very easy to use.
OpenCL vstoren does not store vector in scalar array
I have the kernel as below. My question is why is vstore8 not working? When the output is printed in the host code, it only returns 0s. I put an "if(all(v == 0) == 1)" in the code to check whether the error was caused when I copy the values from int4* to int8 in v, but it was not that. It seems like vstoren is doing nothing. I am new to OpenCL so any help is appreciated. __kernel void select_vec(__global int4 *input1, __global int *input2, __global int *output){ //copy values in input arrays to vectors int i = get_global_id(0); int4 vA = input1[i]; int4 vB = input1[i+1]; __private int8 v = (int8)(vA.s0, vA.s1, vA.s2, vA.s3, vB.s0, vB.s1, vB.s2, vB.s3); __private int8 v1 = vload8(0, input2); __private int8 v2 = vload8(1, input2); int8 results; if(any(v > 10) == 1){ //if there is any of the elements in v that are greater than 10 // copy the corresponding elements from v1 for elements greater than 10 // for elements less than or equal to 17, copy the corresponding elements from v2 results = select(v1, v2, v > 10); }else{ //results is the combination of the first half of v2 and v2 results = (int8) (v1.lo, v2.lo); } /* for testing of the error is due to vstoren */ // results = (int8) (1); //store results in output array vstore8(results, i, output); }
Do you mean int8 v1 = vload8(i+0, input2);, int8 v2 = vload8(i+1, input2); and vstore8(results, i, output);? Currently you read from the same memory addresses in input2 (0-7 for v1 and 8-15 for v2) and write to the same memory address in output (0-7) with all threads. This is a race condition because depending on v and the last thread writing to output, you can get randomly different results. But if input2 starts with 0s in addresses 0-15 and output is initialized with all 0s, it will remain all 0s.
here api shows smaller trafficTime for route with same start and end points but via addtional waypoint
How come trafficTime (fastest) from A to B is longer than trafficTime from A to B via C, in the same arrival/departure times? V7 requests: A to B: trafficTime = 4979 seconds https://route.ls.hereapi.com/routing/7.2/calculateroute.json?apiKey=KEY&waypoint0=32.6289624435649%2C35.079885159610136&waypoint1=32.0155%2C34.7505&mode=fastest%3Bcar&combineChange=true&language=he&instructionformat=text&departure=2020-05-21T10:00:00.000Z A to B via C: trafficTime = 4936 seconds https://route.ls.hereapi.com/routing/7.2/calculateroute.json?apiKey=KEY&waypoint0=32.6289624435649%2C35.079885159610136&waypoint1=32.119485%2C34.938341&waypoint2=32.0155%2C34.7505&mode=fastest%3Bcar&combineChange=true&language=he&instructionformat=text&departure=2020-05-21T10:00:00.000Z V8 also A to B: duration = 3987 seconds https://router.hereapi.com/v8/routes?transportMode=car&return=travelSummary,summary,polyline,actions&origin=32.6289624435649,35.079885159610136&destination=32.0155,34.7505&apikey=KEY A to B via C: duration = 3955 seconds https://router.hereapi.com/v8/routes?transportMode=car&return=travelSummary,summary,polyline,actions&origin=32.6289624435649,35.079885159610136&destination=32.0155,34.7505&via=32.119485,34.938341!stopDuration=0&apikey=KEY If I request to show several alternatives (available only without "arrival" param) I get even faster and shorter route: V7 A to B with alternatives: fastest trafficTime = 4765 https://route.ls.hereapi.com/routing/7.2/calculateroute.json?apiKey=KEY&waypoint0=32.6289624435649%2C35.079885159610136&waypoint1=32.0155%2C34.7505&mode=fastest%3Bcar&combineChange=true&language=he&instructionformat=text&departure=2020-05-21T10:00:00.000Z&alternatives=9
creating the same scenario at our end, clearly shows that their is significant route change when we have specified the intermediate waypoint as given in API which is reducing the travel time. please chk the attachment
Knapsack algorithm: strange behavior with pown() on the gpu
The version on the CPU OCL produces right results, where the GPU OCL in some places gives slightly different results in some places that after all influence the correctness of the result. I have debugged on Intel OCL SDK where I get right results. I haven't spotted any race condition or concurrent access to the memory. This problem has appeared after I have introduced in the kernel (one line of code) pown function. void kernel knapsack(global int *input_f, global int *output_f, global uint *m_d, int cmax, int weightk, int pk, int maxelem, int i){ int c = get_global_id(0)+cmax; if(get_global_id(0)<maxelem){ if(input_f[c] < input_f[c - weightk] + pk){ output_f[c] = input_f[c - weightk] + pk; m_d[c-1] = pown(2.0,i); *//previous version: m_d[c-1] = 1;* } else{ output_f[c] = input_f[c]; } } } The purpose of pown is to compress the m_d buffer which holds the outcomes. For example 1 0 1 0 2^0+2^2, 2^1, 2^0, 2^1 0 1 0 1 => 1 0 0 0 On the gpu I get something like this: 2^0+2^2, 2^1, 2^0+2^2, 2^1 in the 3rd column I access to pown one more again, when I'm not supposed to. This gives me that "slight" different result. Here you can find full code This work is based on this article: Solving knapsack problems on GPU by V. Boyera, D. El Baza, M. Elkihel related work: Accelerating the knapsack problem on GPUs by Bharath Suri
Referencing individual pins, in Hex, Bit masking
I'm using a netduino plus 2, and need to understand how to convert an individual pins's number into hex for bit masking, eg: PERSUDO if counter_value_bit_1 is 1, do: write 1 to D0 pin else write 0 to D0 pin ..... counting from bit_1 through bit_9. if counter_value_bit_9 is 1, do: write 1 to D0 pin else write 0 to D0 pin Answer if (counter_value & 0x01) { //bit_1 ...} if (counter_value & 0x200) { //bit_9 ...} My question: how do you get 0x200 = bit 9, ect? An example or two for bits in between 1 and 9 would be great. THANKS
Which language do you use? In C you cant handle a bit "9" in a portable way, see byte ordering. If you know the byte order you could extract the byte with bit 9. #define BYTE_WITH_BIT_9 ... int counter_value = 42; ((char*)counter_value)[BYTE_WITH_BIT_9] Cast to char* to access the raw bytes. Then chose the byte and now you can do magic bit operations.