On the usage of clCreateImage2D - opencl

Usually the second parameter in clCreateImage2D is a flag CL_MEM_READ etc. But I found it 0 in one of the sample codes (P. no: 80, Heterogeneous Computing using openCL ):
//Create space for the source image on the device
cl_mem bufferSourceImage = clCreateImage2D(
context,0,&format, width,height,0,NULL,NULL);
Why it is so?

cl_mem_flags are bitfields:
cl.h
/* cl_mem_flags - bitfield */
#define CL_MEM_READ_WRITE (1 << 0)
#define CL_MEM_WRITE_ONLY (1 << 1)
#define CL_MEM_READ_ONLY (1 << 2)
#define CL_MEM_USE_HOST_PTR (1 << 3)
#define CL_MEM_ALLOC_HOST_PTR (1 << 4)
#define CL_MEM_COPY_HOST_PTR (1 << 5)
// reserved (1 << 6)
#define CL_MEM_HOST_WRITE_ONLY (1 << 7)
#define CL_MEM_HOST_READ_ONLY (1 << 8)
#define CL_MEM_HOST_NO_ACCESS (1 << 9)
Here, 0 is a default value for CL_MEM_READ_WRITE :
A bit-field that is used to specify allocation and usage information
such as the memory arena that should be used to allocate the buffer
object and how it will be used. The following table describes the
possible values for flags. If value specified for flags is 0, the
default is used which is CL_MEM_READ_WRITE.
From: clCreateBuffer

Related

How do I convert a 7-byte number saved in HEX format to a decimal number if the largest supported data type is 32-bits long?

I am trying to output a 7-byte Unique Identifier (UID) from a Mifare Classic card to a screen in decimal format.
I have already read and split the UID into a HEX array containing the high and low nibbles. Therefore I have an array with 14 elements. If I had a 64-bit data type available, I would normally loop through the elements in the array and multiply them by the corresponding value to get the UID in decimal format. I would then separate each digit of the UID into an array to be output to the screen.
The issue is that the largest supported data type available is 32-bits on the Keil C51 compiler for 8051 processor.
Can somebody point me in the right direction please?
Pseudo-code example with 2-byte UID:
UID = 0x1234, UID_array = [0x01 0x02 0x03 0x04], Decimal_value = 0, Decimal_digit[5] = 0
Decimal_value += 0x1000 * 0x01
Decimal_value += 0x100 * 0x02
Decimal_value += 0x10 * 0x03
Decimal_value += 1 * 0x04
Decimal_value = 4369
Decimal_digit[0] = (uint8)((Decimal_value/10000) % 10)
Decimal_digit[1] = (uint8)((Decimal_value/1000) % 10)
Decimal_digit[2] = (uint8)((Decimal_value/100) % 10)
Decimal_digit[3] = (uint8)((Decimal_value/10) % 10)
Decimal_digit[4] = (uint8)(Decimal_value % 10)

How to bruteforce a lossy AND routine?

Im wondering whether there are any standard approaches to reversing AND routines by brute force.
For example I have the following transformation:
MOV(eax, 0x5b3e0be0) <- Here we move 0x5b3e0be0 to EDX.
MOV(edx, eax) # Here we copy 0x5b3e0be0 to EAX as well.
SHL(edx, 0x7) # Bitshift 0x5b3e0be0 with 0x7 which results in 0x9f05f000
AND(edx, 0x9d2c5680) # AND 0x9f05f000 with 0x9d2c5680 which results in 0x9d045000
XOR(edx, eax) # XOR 0x9d045000 with original value 0x5b3e0be0 which results in 0xc63a5be0
My question is how to brute force and reverse this routine (i.e. transform 0xc63a5be0 back into 0x5b3e0be0)
One idea i had (which didn't work) was this using PeachPy implementation:
#Input values
MOV(esi, 0xffffffff) < Initial value to AND with, which will be decreased by 1 in a loop.
MOV(cl, 0x1) < Initial value to SHR with which will be increased by 1 until 0x1f.
MOV(eax, 0xc63a5be0) < Target result which I'm looking to get using the below loop.
MOV(edx, 0x5b3e0be0) < Input value which will be transformed.
sub_esi = peachpy.x86_64.Label()
with loop:
#End the loop if ESI = 0x0
TEST(esi, esi)
JZ(loop.end)
#Test the routine and check if it matches end result.
MOV(ebx, eax)
SHR(ebx, cl)
TEST(ebx, ebx)
JZ(sub_esi)
AND(ebx, esi)
XOR(ebx, eax)
CMP(ebx, edx)
JZ(loop.end)
#Add to the CL register which is used for SHR.
#Also check if we've reached the last potential value of CL which is 0x1f
ADD(cl, 0x1)
CMP(cl, 0x1f)
JNZ(loop.begin)
#Decrement ESI by 1, reset CL and restart routine.
peachpy.x86_64.LABEL(sub_esi)
SUB(esi, 0x1)
MOV(cl, 0x1)
JMP(loop.begin)
#The ESI result here will either be 0x0 or a valid value to AND with and get the necessary result.
RETURN(esi)
Maybe an article or a book you can recommend specific to this?
It's not lossy, the final operation is an XOR.
The whole routine can be modeled in C as
#define K 0x9d2c5680
uint32_t hash(uint32_t num)
{
return num ^ ( (num << 7) & K);
}
Now, if we have two bits x and y and the operation x XOR y, when y is zero the result is x.
So given two numbers n1 and n2 and considering their XOR, the bits or n1 that pairs with a zero in n2 would make it to the result unchanged (the others will be flipped).
So in considering num ^ ( (num << 7) & K) we can identify num with n1 and (num << 7) & K with n2.
Since n2 is an AND, we can tell that it must have at least the same zero bits that K has.
This means that each bit of num that corresponds to a zero bit in the constant K will make it unchanged into the result.
Thus, by extracting those bits from the result we already have a partial inverse function:
/*hash & ~K extracts the bits of hash that pair with a zero bit in K*/
partial_num = hash & ~K
Technically, the factor num << 7 would also introduce other zeros in the result of the AND. We know for sure that the lowest 7 bits must be zero.
However K already has the lowest 7 bits zero, so we cannot exploit this information.
So we will just use K here, but if its value were different you'd need to consider the AND (which, in practice, means to zero the lower 7 bits of K).
This leaves us with 13 bits unknown (the ones corresponding to the bits that are set in K).
If we forget about the AND for a moment, we would have x ^ (x << 7) meaning that
hi = numi for i from 0 to 6 inclusive
hi = numi ^ numi-7 for i from 7 to 31 inclusive
(The first line is due to the fact that the lower 7 bits of the right-hand are zero)
From this, starting from h7 and going up, we can retrive num7 as h7 ^ num0 = h7 ^ h0.
From bit 7 onward, the equality doesn't work and we need to use numk (for the suitable k) but luckily we already have computed its value in a previous step (that's why we start from lower to higher).
What the AND does to this is just restricting the values the index i runs in, specifically only to the bits that are set in K.
So to fill in the thirteen remaining bits one have to do:
part_num7 = h7 ^ part_num0
part_num9 = h9 ^ part_num2
part_num12 = h12 ^ part_num5
...
part_num31 = h31 ^ part_num24
Note that we exploited that fact that part_num0..6 = h0..6.
Here's a C program that inverts the function:
#include <stdio.h>
#include <stdint.h>
#define BIT(i, hash, result) ( (((result >> i) ^ (hash >> (i+7))) & 0x1) << (i+7) )
#define K 0x9d2c5680
uint32_t base_candidate(uint32_t hash)
{
uint32_t result = hash & ~K;
result |= BIT(0, hash, result);
result |= BIT(2, hash, result);
result |= BIT(3, hash, result);
result |= BIT(5, hash, result);
result |= BIT(7, hash, result);
result |= BIT(11, hash, result);
result |= BIT(12, hash, result);
result |= BIT(14, hash, result);
result |= BIT(17, hash, result);
result |= BIT(19, hash, result);
result |= BIT(20, hash, result);
result |= BIT(21, hash, result);
result |= BIT(24, hash, result);
return result;
}
uint32_t hash(uint32_t num)
{
return num ^ ( (num << 7) & K);
}
int main()
{
uint32_t tester = 0x5b3e0be0;
uint32_t candidate = base_candidate(hash(tester));
printf("candidate: %x, tester %x\n", candidate, tester);
return 0;
}
Since the original question was how to "bruteforce" instead of solve here's something that I eventually came up with which works just as well. Obviously its prone to errors depending on input (might be more than 1 result).
from peachpy import *
from peachpy.x86_64 import *
input = 0xc63a5be0
x = Argument(uint32_t)
with Function("DotProduct", (x,), uint32_t) as asm_function:
LOAD.ARGUMENT(edx, x) # EDX = 1b6fb67c
MOV(esi, 0xffffffff)
with Loop() as loop:
TEST(esi,esi)
JZ(loop.end)
MOV(eax, esi)
SHL(eax, 0x7)
AND(eax, 0x9d2c5680)
XOR(eax, esi)
CMP(eax, edx)
JZ(loop.end)
SUB(esi, 0x1)
JMP(loop.begin)
RETURN(esi)
#Read Assembler Return
abi = peachpy.x86_64.abi.detect()
encoded_function = asm_function.finalize(abi).encode()
python_function = encoded_function.load()
print(hex(python_function(input)))

ATTINY84: Weird problem with reversed byte order

I am encoding 6 values (4x 3bit + 1bit) into a 16bit integer and transfer them via serial to an ATTINY84 splitting them into 2 bytes. That works all good until the point that I re-assemble the bytes into a 16bit int.
Example:
I am sending the following binary state 0001110000001100 which translates to 7180 and gets split into a byte array of [18, 28].
I am putting that byte array into the EEPROM and read it on the next power cycle.
After power cycle my serial debug output looks like this:
18
28
7180
Awesome. Looks all good and my code for that part is:
byte d0 = EEPROM.read(0);
byte d1 = EEPROM.read(1);
unsigned int w = d0 + (256 * d1);
But now the weirdest thing happens. When I do a bit-by-bit read I am getting back:
0011000000111000
should be:
0001110000001100
via:
for(byte t = 0; t < 16; t++) {
serial.print(bitRead(w, t) ? "1" : "0");
}
The bit representation is completely reversed. How is that possible? Or maybe I am missing something.
Also I confirmed when I extract the actual 3 bit location to receive my original value 0..7 it's all off.
Any help would be appreciated.
So it looks like I fell into the little/big endian trap.
Basically as Alain said, in the comments - everything is correct and it's just the representations.
I came up with the following method that can extract bits from a little endian stored number that needs to be in a big endian format:
/**
* #bex
*/
uint8_t bexd(uint16_t n, uint8_t o, uint8_t l, uint8_t d) {
uint8_t v = 0;
uint8_t ob = d - o;
for (uint8_t b=ob; b > (ob-l); b--) v = ( v << 1 ) | ( 0x0001 & ( n >> (b-1) ) );
return v;
}
uint8_t bexw(uint16_t n, uint8_t o, uint8_t l) {return bexd(n, o, l, 16);}
uint8_t bexb(uint8_t n, uint8_t o, uint8_t l) {return bexd(n, o, l, 8);}
For example:
In big endian the "second" value is stored in bit 3,4, and 5, compared to little endian where it will be stored in bit 10, 11, and 12. The method above allows to work a "little endian" value like it would be an "big endian" value.
To extract the second value from this value 0011000000111000 just do:
byte v = bex(7180, 3, 3); // 111
Serial.println(v); // prints 255
Hope that helps someone.

Is there any languages for querying CBOR?

I'm looking for a languages for querying CBOR, like JsonPath or jq but for CBOR binary format. I don't want to convert from CBOR to JSON because some CBOR type is not existed in JSON, and performance issue.
The C++ library jsoncons allows you to query CBOR with JSONPath, for example,
#include <jsoncons/json.hpp>
#include <jsoncons_ext/cbor/cbor.hpp>
#include <jsoncons_ext/jsonpath/json_query.hpp>
#include <iomanip>
using namespace jsoncons; // For convenience
int main()
{
std::vector<uint8_t> v = {0x85,0xfa,0x40,0x0,0x0,0x0,0xfb,0x3f,0x12,0x9c,0xba,0xb6,0x49,0xd3,0x89,0xc3,0x49,0x1,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0xc4,0x82,0x38,0x1c,0xc2,0x4d,0x1,0x8e,0xe9,0xf,0xf6,0xc3,0x73,0xe0,0xee,0x4e,0x3f,0xa,0xd2,0xc5,0x82,0x20,0x3};
/*
85 -- Array of length 5
fa -- float
40a00000 -- 5.0
fb -- double
3f129cbab649d389 -- 0.000071
c3 -- Tag 3 (negative bignum)
49 -- Byte string value of length 9
010000000000000000
c4 -- Tag 4 (decimal fraction)
82 -- Array of length 2
38 -- Negative integer of length 1
1c -- -29
c2 -- Tag 2 (positive bignum)
4d -- Byte string value of length 13
018ee90ff6c373e0ee4e3f0ad2
c5 -- Tag 5 (bigfloat)
82 -- Array of length 2
20 -- -1
03 -- 3
*/
// Decode to a json value (despite its name, it is not JSON specific.)
json j = cbor::decode_cbor<json>(v);
// Serialize to JSON
std::cout << "(1)\n";
std::cout << pretty_print(j);
std::cout << "\n\n";
// as<std::string>() and as<double>()
std::cout << "(2)\n";
std::cout << std::dec << std::setprecision(15);
for (const auto& item : j.array_range())
{
std::cout << item.as<std::string>() << ", " << item.as<double>() << "\n";
}
std::cout << "\n";
// Query with JSONPath
std::cout << "(3)\n";
json result = jsonpath::json_query(j,"$.[?(# < 1.5)]");
std::cout << pretty_print(result) << "\n\n";
// Encode result as CBOR
std::vector<uint8_t> val;
cbor::encode_cbor(result,val);
std::cout << "(4)\n";
for (auto c : val)
{
std::cout << std::hex << std::setprecision(2) << std::setw(2)
<< std::setfill('0') << static_cast<int>(c);
}
std::cout << "\n\n";
/*
83 -- Array of length 3
fb -- double
3f129cbab649d389 -- 0.000071
c3 -- Tag 3 (negative bignum)
49 -- Byte string value of length 9
010000000000000000
c4 -- Tag 4 (decimal fraction)
82 -- Array of length 2
38 -- Negative integer of length 1
1c -- -29
c2 -- Tag 2 (positive bignum)
4d -- Byte string value of length 13
018ee90ff6c373e0ee4e3f0ad2
*/
}
Output:
(1)
[
2.0,
7.1e-05,
"-18446744073709551617",
"1.23456789012345678901234567890",
[-1, 3]
]
(2)
2.0, 2
7.1e-05, 7.1e-05
-18446744073709551617, -1.84467440737096e+19
1.23456789012345678901234567890, 1.23456789012346
1.5, 1.5
(3)
[
7.1e-05,
"-18446744073709551617",
"1.23456789012345678901234567890"
]
(4)
83fb3f129cbab649d389c349010000000000000000c482381cc24d018ee90ff6c373e0ee4e3f0ad2
Sure, you can use any general purpose programming language for querying CBOR, for example JavaScript might be a good choice. But if you are looking for a "query language" like JsonPath, I'm not aware of any specifically developed for CBOR.

How to make ARGB transparency using bitwise operators

I need to make transparency, having 2 pixels:
pixel1: {A, R, G, B} - foreground pixel
pixel2: {A, R, G, B} - background pixel
A,R,G,B are Byte values
each color is represented by byte value
now I'm calculating transparency as:
newR = pixel2_R * alpha / 255 + pixel1_R * (255 - alpha) / 255
newG = pixel2_G * alpha / 255 + pixel1_G * (255 - alpha) / 255
newB = pixel2_B * alpha / 255 + pixel1_B * (255 - alpha) / 255
but it is too slow
I need to do it with bitwise operators (AND,OR,XOR, NEGATION, BIT MOVE)
I want to do it on Windows Phone 7 XNA
---attached C# code---
public static uint GetPixelForOpacity(uint reduceOpacityLevel, uint pixelBackground, uint pixelForeground, uint pixelCanvasAlpha)
{
byte surfaceR = (byte)((pixelForeground & 0x00FF0000) >> 16);
byte surfaceG = (byte)((pixelForeground & 0x0000FF00) >> 8);
byte surfaceB = (byte)((pixelForeground & 0x000000FF));
byte sourceR = (byte)((pixelBackground & 0x00FF0000) >> 16);
byte sourceG = (byte)((pixelBackground & 0x0000FF00) >> 8);
byte sourceB = (byte)((pixelBackground & 0x000000FF));
uint newR = sourceR * pixelCanvasAlpha / 256 + surfaceR * (255 - pixelCanvasAlpha) / 256;
uint newG = sourceG * pixelCanvasAlpha / 256 + surfaceG * (255 - pixelCanvasAlpha) / 256;
uint newB = sourceB * pixelCanvasAlpha / 256 + surfaceB * (255 - pixelCanvasAlpha) / 256;
return (uint)255 << 24 | newR << 16 | newG << 8 | newB;
}
You can't do an 8 bit alpha blend using only bitwise operations, unless you basically re-invent multiplication with basic ops (8 shift-adds).
You can do two methods as mentioned in other answers: use 256 instead of 255, or use a lookup table. Both have issues, but you can mitigate them. It really depends on what architecture you're doing this on: the relative speed of multiply, divide, shift, add and memory loads. In any case:
Lookup table: a trivial 256x256 lookup table is 64KB. This will thrash your data cache and end up being very slow. I wouldn't recommend it unless your CPU has an abysmally slow multiplier, but does have low latency RAM. You can improve performance by throwing away some alpha bits, e.g A>>3, resulting in 32x256=8KB of lookup, which has a better chance of fitting in cache.
Use 256 instead of 255: the idea being divide by 256 is just a shift right by 8. This will be slightly off and tend to round down, darkening the image slightly, e.g if R=255, A=255 then (R*A)/256 = 254. You can cheat a little and do this: (R*A+R+A)/256 or just (R*A+R)/256 or (R*A+A)/256 = 255. Or, scale A to 0..256 first, e.g: A = (256*A)/255. That's just one expensive divide-by-255 instead of 6. Then, (R*A)/256 = 255.
I don't think it can be done with the same precision using only those operators. Your best bet is, I reckon, using a LUT (as long as the LUT can fit in the CPU cache, otherwise it might even be slower)
// allocate the LUT (64KB)
unsigned char lut[256*256] __cacheline_aligned; // __cacheline_aligned is a GCC-ism
// macro to access the LUT
#define LUT(pixel, alpha) (lut[(alpha)*256+(pixel)])
// precompute the LUT
for (int alpha_value=0; alpha_value<256; alpha_value++) {
for (int pixel_value=0; pixel_value<256; pixel_value++) {
LUT(pixel_value, alpha_value) = (unsigned char)((double)(pixel_value) * (double)(alpha_value) / 255.0));
}
}
// in the loop
unsigned char ialpha = 255-alpha;
newR = LUT(pixel2_R, alpha) + LUT(pixel1_R, ialpha);
newG = LUT(pixel2_G, alpha) + LUT(pixel1_G, ialpha);
newB = LUT(pixel2_B, alpha) + LUT(pixel1_B, ialpha);
otherwise you should try vectorizing your code. But to do that you should at least provide us with more info on your CPU architecture and compiler. Keep in mind that your compiler might be able to vectorize automatically, if provided with the right options.

Resources