SIMD instruction for updating value at address present in array

SIMD instruction for updating value at address present in array - intel

Need to know whether Intel Xeon Phi scatter / gather can implement this use case
int *ptr[3];
int a , b , c ;
ptr[0] = &a;
ptr[1] = &b;
ptr[2] = &c;
Want to add 1 in a, b and c using vector instruction and ptr
Need vector operation not on immediate value but on value at address,

Related

Verilog 4-bit ripple adder which is made up of full adders

I simulated a 4-bit ripple adder made up of 4 full adders in Verilog. Here, I'm trying to understand what is happening with Cout. Cout stands for carry output. I can't explain how values E and F were obtained in Cout.
This is ripple_adder.v
module full_adder( A, B, CIN, Q, COUT );
input A, B, CIN;
output Q, COUT;
assign Q = A ^ B ^ CIN;
assign COUT = (A & B) | (B & CIN) | (CIN & A);
endmodule
module adder_ripple( a, b, q );
input [3:0] a, b;
output [3:0] q;
wire [3:0] cout;
full_adder add0 ( .Q(q[0]), .COUT(cout[0]),
.A(a[0]), .B(b[0]), .CIN( 1'b0) );
full_adder add1 ( .Q(q[1]), .COUT(cout[1]),
.A(a[1]), .B(b[1]), .CIN(cout[0]) );
full_adder add2 ( .Q(q[2]), .COUT(cout[2]),
.A(a[2]), .B(b[2]), .CIN(cout[1]) );
full_adder add3 ( .Q(q[3]), .COUT(cout[3]),
.A(a[3]), .B(b[3]), .CIN(cout[2]) );
endmodule
This is test bench for ripple_adder.v
`timescale 1ps/1ps
module adder_ripple_tp;
reg [3:0] a, b; // reg declaration for input
wire [3:0] q; // wire declaration for output
parameter STEP = 100000;
adder_ripple adder_ripple( a, b, q );
initial begin
$dumpfile("adder_ripple.vcd");
$dumpvars(0, adder_ripple_tp);
a = 4'h0; b = 4'h0;
#STEP a = 4'h5; b = 4'ha;
#STEP a = 4'h7; b = 4'ha;
#STEP a = 4'h1; b = 4'hf;
#STEP a = 4'hf; b = 4'hf;
#STEP $finish;
end
initial $monitor( $stime, " a=%h b=%h q=%h", a, b, q );
endmodule
The wave looks like this:
Can someone help me understand it?

The value of cout[3] represents the value of the 2^4=16, when it is asserted,0 when de-asserted.
For the vector where a=7, b=0xa=10, the answer is 17, which is indicated by the sum of value of q=1 + the value of cout[3] = 16.
cout is equal to 0xe=4'b1110 in this vector, indicating the sum of the least significant digit did not carry out, and the sum of each of the other digits did carry out.
For the vector where a=1, b=0xf=15, the answer is 16, which is indicated by the sum of q=0 + the value of cout[3] which is 16.
cout is equal to 0xf=4'b1111 in this state indicating the sum of each digits carried out.
For the vector where a=0xf=15, b=0xf=15, the answer is 30, which is indicated by the sum of q=0xe=14 + the value of cout[3] which is 16.
cout is equal to 0xf=4'b1111 in this state indicating the sum of each digits carried out.

How to bruteforce a lossy AND routine?

Im wondering whether there are any standard approaches to reversing AND routines by brute force.
For example I have the following transformation:
MOV(eax, 0x5b3e0be0) <- Here we move 0x5b3e0be0 to EDX.
MOV(edx, eax) # Here we copy 0x5b3e0be0 to EAX as well.
SHL(edx, 0x7) # Bitshift 0x5b3e0be0 with 0x7 which results in 0x9f05f000
AND(edx, 0x9d2c5680) # AND 0x9f05f000 with 0x9d2c5680 which results in 0x9d045000
XOR(edx, eax) # XOR 0x9d045000 with original value 0x5b3e0be0 which results in 0xc63a5be0
My question is how to brute force and reverse this routine (i.e. transform 0xc63a5be0 back into 0x5b3e0be0)
One idea i had (which didn't work) was this using PeachPy implementation:
#Input values
MOV(esi, 0xffffffff) < Initial value to AND with, which will be decreased by 1 in a loop.
MOV(cl, 0x1) < Initial value to SHR with which will be increased by 1 until 0x1f.
MOV(eax, 0xc63a5be0) < Target result which I'm looking to get using the below loop.
MOV(edx, 0x5b3e0be0) < Input value which will be transformed.
sub_esi = peachpy.x86_64.Label()
with loop:
#End the loop if ESI = 0x0
TEST(esi, esi)
JZ(loop.end)
#Test the routine and check if it matches end result.
MOV(ebx, eax)
SHR(ebx, cl)
TEST(ebx, ebx)
JZ(sub_esi)
AND(ebx, esi)
XOR(ebx, eax)
CMP(ebx, edx)
JZ(loop.end)
#Add to the CL register which is used for SHR.
#Also check if we've reached the last potential value of CL which is 0x1f
ADD(cl, 0x1)
CMP(cl, 0x1f)
JNZ(loop.begin)
#Decrement ESI by 1, reset CL and restart routine.
peachpy.x86_64.LABEL(sub_esi)
SUB(esi, 0x1)
MOV(cl, 0x1)
JMP(loop.begin)
#The ESI result here will either be 0x0 or a valid value to AND with and get the necessary result.
RETURN(esi)
Maybe an article or a book you can recommend specific to this?

It's not lossy, the final operation is an XOR.
The whole routine can be modeled in C as
#define K 0x9d2c5680
uint32_t hash(uint32_t num)
{
return num ^ ( (num << 7) & K);
}
Now, if we have two bits x and y and the operation x XOR y, when y is zero the result is x.
So given two numbers n1 and n2 and considering their XOR, the bits or n1 that pairs with a zero in n2 would make it to the result unchanged (the others will be flipped).
So in considering num ^ ( (num << 7) & K) we can identify num with n1 and (num << 7) & K with n2.
Since n2 is an AND, we can tell that it must have at least the same zero bits that K has.
This means that each bit of num that corresponds to a zero bit in the constant K will make it unchanged into the result.
Thus, by extracting those bits from the result we already have a partial inverse function:
/*hash & ~K extracts the bits of hash that pair with a zero bit in K*/
partial_num = hash & ~K
Technically, the factor num << 7 would also introduce other zeros in the result of the AND. We know for sure that the lowest 7 bits must be zero.
However K already has the lowest 7 bits zero, so we cannot exploit this information.
So we will just use K here, but if its value were different you'd need to consider the AND (which, in practice, means to zero the lower 7 bits of K).
This leaves us with 13 bits unknown (the ones corresponding to the bits that are set in K).
If we forget about the AND for a moment, we would have x ^ (x << 7) meaning that
hi = numi for i from 0 to 6 inclusive
hi = numi ^ numi-7 for i from 7 to 31 inclusive
(The first line is due to the fact that the lower 7 bits of the right-hand are zero)
From this, starting from h7 and going up, we can retrive num7 as h7 ^ num0 = h7 ^ h0.
From bit 7 onward, the equality doesn't work and we need to use numk (for the suitable k) but luckily we already have computed its value in a previous step (that's why we start from lower to higher).
What the AND does to this is just restricting the values the index i runs in, specifically only to the bits that are set in K.
So to fill in the thirteen remaining bits one have to do:
part_num7 = h7 ^ part_num0
part_num9 = h9 ^ part_num2
part_num12 = h12 ^ part_num5
...
part_num31 = h31 ^ part_num24
Note that we exploited that fact that part_num0..6 = h0..6.
Here's a C program that inverts the function:
#include <stdio.h>
#include <stdint.h>
#define BIT(i, hash, result) ( (((result >> i) ^ (hash >> (i+7))) & 0x1) << (i+7) )
#define K 0x9d2c5680
uint32_t base_candidate(uint32_t hash)
{
uint32_t result = hash & ~K;
result |= BIT(0, hash, result);
result |= BIT(2, hash, result);
result |= BIT(3, hash, result);
result |= BIT(5, hash, result);
result |= BIT(7, hash, result);
result |= BIT(11, hash, result);
result |= BIT(12, hash, result);
result |= BIT(14, hash, result);
result |= BIT(17, hash, result);
result |= BIT(19, hash, result);
result |= BIT(20, hash, result);
result |= BIT(21, hash, result);
result |= BIT(24, hash, result);
return result;
}
uint32_t hash(uint32_t num)
{
return num ^ ( (num << 7) & K);
}
int main()
{
uint32_t tester = 0x5b3e0be0;
uint32_t candidate = base_candidate(hash(tester));
printf("candidate: %x, tester %x\n", candidate, tester);
return 0;
}

Since the original question was how to "bruteforce" instead of solve here's something that I eventually came up with which works just as well. Obviously its prone to errors depending on input (might be more than 1 result).
from peachpy import *
from peachpy.x86_64 import *
input = 0xc63a5be0
x = Argument(uint32_t)
with Function("DotProduct", (x,), uint32_t) as asm_function:
LOAD.ARGUMENT(edx, x) # EDX = 1b6fb67c
MOV(esi, 0xffffffff)
with Loop() as loop:
TEST(esi,esi)
JZ(loop.end)
MOV(eax, esi)
SHL(eax, 0x7)
AND(eax, 0x9d2c5680)
XOR(eax, esi)
CMP(eax, edx)
JZ(loop.end)
SUB(esi, 0x1)
JMP(loop.begin)
RETURN(esi)
#Read Assembler Return
abi = peachpy.x86_64.abi.detect()
encoded_function = asm_function.finalize(abi).encode()
python_function = encoded_function.load()
print(hex(python_function(input)))

behaviour of atomic_add in opencl

I'm playing around with an example on opencl:
__kernel void atomic(__global int* x) {
__local int a, b;
a = 0; b = 0;
a++;
atomic_inc(&b);
x[0] = a;
x[1] = b;
x[2]++;
atomic_inc(x+3);
}
Running this code with global_size = 1024 and workgroup_size = 8, this is the following output:
[1 8 1 1024]
I can understand what is happening for all cases except the value given for x[1]. Why is the value of x[1] not 1024 but 8?

Under x[1] is stored value of b which is a variable residing in __local address space meaning the variable is shared by all work items within a workgroup. Each of workgroup have b initialized to 0 and atomically incremented to 8 because workgroup size is 8 (each work item increments by 1).

How is uint overflow defined in OpenCL?

What happens when the result of a multiplication or sum in OpenCL overflows? Does it wrap?
In particular I'd like to know if I can catch an overflow in
uint4 x = ( get_global_id( 0 ) * 4 + (uint4)(0, 1, 2, 3) ) * q + r;
with
int4 invalid = x < get_global_id( 0 ) * 4;
or how else that would be possible. (Assuming r >= 0 && q > r && q < (1 << 20) and the id will be at most just big enough to cause an overflow.)
Context: I want to check every 32 bit uint x for which x % q == r , where q and r are known. With vectors I can check 4 at a time, but the number of tests may not be divisible by 4.
I'm targeting the GPU, but that shouldn't be relevant, right?

OpenCL 1.2 standard (section 6.2.3.3) refers to C99 standard (section 6.3.1.3):
...if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type.
Generally, get_global_id returns size_t, so narrowing conversion is bad idea IMO. Though, I never faced NDRange big enough to exceed uint range.

create a random sequence, skip to any part of the sequence

In Linux. There is an srand() function, where you supply a seed and it will guarantee the same sequence of pseudorandom numbers in subsequent calls to the random() function afterwards.
Lets say, I want to store this pseudo random sequence by remembering this seed value.
Furthermore, let's say I want the 100 thousandth number in this pseudo random sequence later.
One way, would be to supply the seed number using srand(), and then calling random() 100 thousand times, and remembering this number.
Is there a better way of skipping all 99,999 other numbers in the pseudo random list and directly getting the 100 thousandth number in the list.
thanks,
m

I'm not sure there's a defined standard for implementing rand on any platform; however, picking this one from the GNU Scientific Library:
— Generator: gsl_rng_rand
This is the BSD rand generator. Its sequence is
xn+1 = (a xn + c) mod m
with a = 1103515245, c = 12345 and m = 231. The seed specifies the initial value, x1. The period of this generator is 231, and it uses 1 word of storage per generator.
So to "know" xn requires you to know xn-1. Unless there's some obvious pattern I'm missing, you can't jump to a value without computing all the values before it. (But that's not necessarily the case for every rand implementation.)
If we start with x1...
x2 = (a * x1 + c) % m
x3 = (a * ((a * x1 + c) % m) + c) % m
x4 = (a * ((a * ((a * x1 + c) % m) + c) % m) + c) % m
x5 = (a * (a * ((a * ((a * x1 + c) % m) + c) % m) + c) % m) + c) % m
It gets out of hand pretty quickly. Is that function easily reducible? I don't think it is.
(There's a statistics phrase for a series where xn depends on xn-1 -- can anyone remind me what that word is?)

If they're available on your system, you can use rand_r instead of rand & srand, or use initstate and setstate with random. rand_r takes an unsigned * as an argument, where it stores its state. After calling rand_r numerous times, save the value of this unsigned integer and use it as the starting value the next time.
For random(), use initstate rather than srandom. Save the contents of the state buffer for any state that you want to restore. To restore a state, fill a buffer with and call setstate. If a buffer is already the current state buffer, you can skip the call to setstate.

This is developed from #Mark's answer using the BSD rand() function.
rand1() computes the nth random number, starting at seed, by stepping through n times.
rand2() computes the same using a shortcut. It can step up to 2^24-1 steps in one go. Internally it requires only 24 steps.
If the BSD random number generator is good enough for you then this will suffice:
#include <stdio.h>
const unsigned int m = (1<<31)-1;
unsigned int a[24] = {
1103515245, 1117952617, 1845919505, 1339940641, 1601471041,
187569281 , 1979738369, 387043841 , 1046979585, 1574914049,
1073647617, 285024257 , 1710899201, 1542750209, 2011758593,
1876033537, 1604583425, 1061683201, 2123366401, 2099249153,
2051014657, 1954545665, 1761607681, 1375731713
};
unsigned int b[24] = {
12345, 1406932606, 1449466924, 1293799192, 1695770928, 1680572000,
422948032, 910563712, 519516928, 530212352, 98880512, 646551552,
940781568, 472276992, 1749860352, 278495232, 556990464, 1113980928,
80478208, 160956416, 321912832, 643825664, 1287651328, 427819008
};
unsigned int rand1(unsigned int seed, unsigned int n)
{
int i;
for (i = 0; i<n; ++i)
{
seed = (1103515245U*seed+12345U) & m;
}
return seed;
}
unsigned int rand2(unsigned int seed, unsigned int n)
{
int i;
for (i = 0; i<24; ++i)
{
if (n & (1<<i))
{
seed = (a[i]*seed+b[i]) & m;
}
}
return seed;
}
int main()
{
printf("%u\n", rand1 (10101, 100000));
printf("%u\n", rand2 (10101, 100000));
}
It's not hard to adapt to any linear congruential generator. I computed the tables in a language with a proper integer type (Haskell), but I could have computed them another way in C using only a few lines more code.

If you always want the 100,000th item, just store it for later.
Or you could gen the sequence and store that... and query for the particular element by index later.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

SIMD instruction for updating value at address present in array - intel

Need to know whether Intel Xeon Phi scatter / gather can implement this use case int *ptr[3]; int a , b , c ; ptr[0] = &a; ptr[1] = &b; ptr[2] = &c; Want to add 1 in a, b and c using vector instruction and ptr Need vector operation not on immediate value but on value at address,

Related

Verilog 4-bit ripple adder which is made up of full adders

How to bruteforce a lossy AND routine?

behaviour of atomic_add in opencl

How is uint overflow defined in OpenCL?

create a random sequence, skip to any part of the sequence

Categories

Resources