Intel pin: getting instruction memory write\read size - intel-pin

I am trying to change a bit the Memory Reference Trace (Instruction Instrumentation) example from the pin documentation.
My goal is to extract from each instruction that access memory also the size of the size of the memory to read\write in bytes.
I looked in the documentation and found that I need to use
IARG_MEMORYREAD_SIZE
IARG_MEMORYWRITE_SIZE
to hold that size.
I couldn't find though in the documentation how to extract this data from the instruction.
here is my code:
for (UINT32 memOp = 0; memOp < memOperands; memOp++)
{
if (INS_MemoryOperandIsRead(ins, memOp))
{
if(INS_hasKnownMemorySize(ins))
{
//IARG_MEMORYREAD_SIZE memReadSize = what to do here?
INS_InsertPredicatedCall(
ins, IPOINT_BEFORE, (AFUNPTR)MyFuncWhenRead,
IARG_INST_PTR,
IARG_MEMORYOP_EA, memOp,
IARG_END);
}
}
if (INS_MemoryOperandIsWritten(ins, memOp))
{
if(INS_hasKnownMemorySize(ins))
{
//IARG_MEMORYREAD_SIZE memWriteSize = what to do here?
INS_InsertPredicatedCall(
ins, IPOINT_BEFORE, (AFUNPTR)MyFuncWhenWrite,
IARG_INST_PTR,
IARG_MEMORYOP_EA, memOp,
IARG_END);
}
}
}
Would appreciate some help solving this.
That is, what to write in the line with the comment
//IARG_MEMORYREAD_SIZE memReadSize = ???
Thanks!

As a quick reminder (this is an important concept in PIN, often overlooked):
Conceptually, instrumentation consists of two components:
A mechanism that decides where and what code is inserted: the instrumentation.
The code to execute at insertion points: the analysis.
The INS_INSERT(xxx)CALL functions are used in the instrumentation routine (to tell to the analysis routine) when and what code is inserted. So, in your code:
INS_InsertPredicatedCall(
ins, IPOINT_BEFORE, (AFUNPTR)MyFuncWhenRead,
IARG_INST_PTR,
IARG_MEMORYOP_EA, memOp,
IARG_END);
IPOINT_BEFORE is the when.
It tells where the analysis routine is inserted relative to the instrumented code (here, the insertion point is made before the instruction).
IARG_INST_PTR, IARG_MEMORYOP_EA are the what.
They determine the arguments that are passed to the analysis routine.
They are received by the analysis routine in the order they are declared.
MyFuncWhenRead is the analysis routine called by the instrumentation.
If you have something that starts with IARG_it is an IARG_TYPE which is has to passed to the INS_Insert(xxx)Call function.
The documentation for IARG_MEMORYREAD_SIZE says:
IARG_MEMORYREAD_SIZE Type: UINT32. Size in bytes of memory read. (...)
The Type tells us what the analysis routine receives.
In your case you have (in this precise order):
IARG_INST_PTR: Type: ADDRINT
IARG_MEMORYOP_EA: Type: ADDRINT
IARG_MEMORYREAD_SIZE: Type: UINT32
Which means your instrumentation function will look like this:
INS_InsertPredicatedCall(
ins, IPOINT_BEFORE, (AFUNPTR)MyFuncWhenRead,
IARG_INST_PTR,
IARG_MEMORYOP_EA, memOp,
IARG_MEMORYREAD_SIZE,
IARG_END);
And your analysis function should look like this:
VOID MyFuncWhenRead(
ADDRINT ins_ptr, // from IARG_INST_PTR (address of the instruction)
ADDRINT mem_op_addr, // from IARG_MEMORYOP_EA (address of the memory read)
UINT32 mem_read_size, // from IARG_MEMORYREAD_SIZE (size of the read)
)
{
// ...
}
The same logic applies to IARG_MEMORYWRITE_SIZE.

Related

Can I have boolean buffer in OpenCL and change its value during kernel execution, example to break while loop

I want to do some experiments in OpenCL and I want to know possibility to change states during kernel execution from host code using buffer.
I attempted to alter the state of a while loop in the kernel code by modifying the buffer value from within the host code, however the execution is hung.
void my_kernel(
__global bool *in,
__global int *out)
{
int i = get_global_id(0);
while(1) {
if(1 == *in) {
printf("while loop is finished");
break;
}
}
printf("out[0] = %d\n", out[0]);
}
I call second time the function clEnqueueWriteBuffer() to change state of input value.
input[0] = 1;
err = clEnqueueWriteBuffer(commands, input_buffer,
CL_TRUE, 0, sizeof(int), (void*)input,
0, NULL,NULL);
At least for OpenCL 1.x, this is not permitted, and any behaviour you may observe in one implementation cannot be relied upon.
See the NOTE in the OpenCL 1.2 specification, section 5.2.2, Reading, Writing and Copying Buffer Objects:
Calling clEnqueueWriteBuffer to update the latest bits in a region of the buffer object with the ptr argument value set to host_ptr + offset, where host_ptr is a pointer to the memory region specified when the buffer object being written is created with CL_MEM_USE_HOST_PTR, must meet the following requirements in order to avoid undefined behavior:
The host memory region given by (host_ptr + offset, cb) contains the latest bits when the enqueued write command begins execution.
The buffer object or memory objects created from this buffer object are not mapped.
The buffer object or memory objects created from this buffer object are not used by any command-queue until the write command has finished execution.
The final condition is not met by your code, therefore its behaviour is undefined.
I am not certain if the situation is different with OpenCL 2.x's Shared Virtual Memory (SVM) feature, as I have no practical experience using it, perhaps someone else can contribute an answer for that.

Using OpenMP with GPU

Everyone good time of day!
I would like to ask the advice of the respected community about the use of GPU computing power instead of or together with the CPU.
I have a well-functioning program based on recursive search of all kinds of combinations of some events, paralleled using OpenMP to run on all available processor cores.
The pseudocode C++ is as follows:
// #includes
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; // (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
}
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
{
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
}
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
else
continue;
}
}
}
}
Unfortunately, I don't have a CPU with a thousand cores at my disposal, and without this, the algorithm works for a very long time. At the place where I work, I was advised to think about using a GPU to speed up calculations. I learned that OpenMP can work with video cards (and especially with NVidia), but OpenACC also does it well.
In this regard, my main question is whether it is possible to simply and, at the same time, effectively set the execution of a recursive algorithm on a GPU? Can this give a noticeable acceleration relative to the CPU? If so, maybe OpenACC will do better? And is it possible to give instructions to the video card through the "#pragma omp task", or are other instructions REQUIRED? And how would it be possible to combine calculations on the CPU and GPU?
Thank you so much for any help!
P.S. I apologize for my English, which is not my native language :)

How does one call a function from it's memory address in AVR C?

I am writing a function:
void callFunctionAt(uint32_t address){
//There is a void at address, how do I run it?
}
This is in Atmel Studio's C++. If previous questions are to be believed, the simple answer is to write the line "address();". This cannot be correct. Without changing the header of this function, how would one call the function located at the address given?
The answer should be system-agnostic for all micro controllers which support standard c++ compilation.
The common way to do this is to give the argument the correct type. Then you can call it right away:
void callFunctionAt(void (*address)()) {
address();
}
However, since you wrote "Without changing the header of this function [...]", you need to cast the unsigned integer to a function pointer:
void callFunctionAt(uint32_t address) {
void (*f)() = reinterpret_cast<void (*f)()>(address);
f();
}
But this is not safe and not portabel because it assumes that the uint32_t can be casted into a function pointer. And this needs not to be true: "[...] system-agnostic for all micro controllers [...]". Function pointers can have other widths than 32 bits. Pointers in general might consist of more than the pure address, for example include a selector for memory spaces, depending on the system's architecture.
If you got the address from a linker script, you might have declared it like this:
extern const uint32_t ext_func;
And like to use it so:
callFunctionAt(ext_func);
But you can change the declaration into:
extern void ext_func();
And call it directly or indirectly:
ext_func();
callFunctionAt(&ext_func);
The definition in the linker can stay as it is, because the linker knows nothing about types.
There is no generic way. It depends on which compiler you are using. In the following I'll assume avr-g++ because it's common and freely available.
Spoiler: On AVR, it's more complicated than on most other machines.
Suppose you actually have a uint32_t address which would be a byte address. Function pointers in avr-g++ are word addresses actually, where a word has 16 bits. Hence, you'll have to divide the byte address by 2 first to get a word address; then cast it to a function pointer and call it:
#include <stdint.h>
typedef void (*func_t)(void);
void callFunctionAt (uint32_t byte_address)
{
func_t func = (func_t) (byte_address >> 1);
func();
}
If you started with a word address, then you can call it without further ado:
void callFunctionAt (uint32_t address)
{
((func_t) word_address)();
}
This will only work for devices with up to 128KiB of flash memory!
The reason is that addresses in avr-g++ are 16 bits long, cf. the layout of void* as per avr-gcc ABI. This means using scalar addresses on devices with flash > 128KiB will not work in general, for example when you issue callFunctionAt (0x30000) on an ATmega2560.
On such devices, the 16-bit address in Z register used by EICALL instruction is extended by the value held in the EIND special function register, and you must not change EIND after entering main. The avr-g++ documentation is clear about that.
The crucial point here is how you are getting the address. First, in order to call and pass it around properly, use a function pointer:
typedef void (*func_t)(void);
void callFunctionAt (func_t address)
{
address();
}
void func (void);
void call_func()
{
func_t addr = func;
callFunctionAt (addr);
}
I am using void argument in the declaration because this is how you'd do it in C.
Or, if you don't like the typedef:
void callFunctionAt (void (*address)(void))
{
address();
}
void func (void);
void call_func ()
{
void (*addr)(void) = func;
callFunctionAt (addr);
}
If you want to call a function at a specific word address like, for example 0x0 to "reset"1 the µC, you could
void call_0x0()
{
callFunctionAt ((func_t) 0x0);
}
but whether this works depends on where your vector table is located, or more specifically, how EIND was initialized by the startup code. What will always work is using a symbol and define it with -Wl,--defsym,func=0 when linking with the following code:
extern "C" void func();
void call_func ()
{
void (*addr)(void) = func;
callFunctionAt (addr);
}
The big difference compared to using 0x0 directly it that the compiler will wrap symbol func with symbol modifier gs which it will not do when using 0x0 directly:
_Z9call_funcv:
ldi r24,lo8(gs(func))
ldi r25,hi8(gs(func))
jmp _Z14callFunctionAtPFvvE
This is needed if the address is out of the scope of EIJMP to advise the linker to generate a stub.
1 This will not reset the hardware. The best approach to force a reset is by letting the watchdog timer (WDT) issue a reset for you.
Methods
Yet another situation is when you want the address of a non-static method of a class because you also need a this pointer in that case:
class A
{
int a = 1;
public:
int method1 () { return a += 1; }
int method2 () { return a += 2; }
};
void callFunctionAt (A *b, int (A::*f)())
{
A a;
(a.*f)();
(b->*f)();
}
void call_method ()
{
A a;
callFunctionAt (&a, &A::method1);
callFunctionAt (&a, &A::method2);
}
The 2nd argument of callFunctionAt specifies which method (of a given prototype) you want, but you also need an object (or pointer to one) to apply it. avr-g++ will use gs when taking the method's address (provided the following call(s) cannot be inlined), thus it will also work for all AVR devices.
Based on comments I think you are asking about how microcontroller calls function.
Could you compile your program to see assembly files?
I would recommend you to read one of them.
Every function after compiling are translated to instructions that CPU can do (loading to register, adding to register etc.).
So then your void foo(int x) {statements;} compile to simple CPU instructions and whenever you call foo(x) in your program, you are moving to instructions that are related to foo - you are calling a subroutine.
As far as I remeber there is a CALL function in AVR to invoke subroutines and the name of subroutine is the label where executing program jump and invoking next instruction at adress.
I think you can clarify your doubts when you read some AVR assembly tutorials.
It is fun (at least for me) to see what exactly CPU do when it calls function that I wrote, but it required to know what instructions do. You develop in AVR so there is a set of instructions that you can read about in this PDF and compare with your assembly files.

Return value of (void*) 57600 in C

I am reading the source code of a UART peripheral and there is a function as below:
eResult = adi_stdio_ControlDevice (hSTDIOUART,
ADI_STDIO_COMMAND_SET_UART_BAUD_RATE, (void *)57600);
This function is used to connect UART and number 57600 is the baudrate. What I do not understand is the meaning of (void*)57600.
I think this maybe a pointer to const and the return value of (void*)57600 is 57600. When we use (void*)57600, does it mean we are creating a pointer that points to the 57600 value?
And why we must use (void*)57600?
Not quite. The "return value" (quoted because it's not actually being returned from a function, instead it's the result of a cast) of (void *)57600 is simply the value 57600 being treated as (or, in other words, cast to) a void pointer.
And, while you are actually converting 57600 to a void pointer, it's almost certainly not being used as a pointer. More likely is that the prototype for adi_stdio_ControlDevice has a generic argument (one that can be used for many things).
Device control functions are particularly apt to do that since they are meant to be generic across a large variety of devices, so you may have to give a wide variety of types to the calls.
You'll probably find that, for the command to set the baud rate, it simply gets cast back to an integral value at the other end before being used, something like:
static int localSpeed;
static char *localString;
static double localPi;
static struct rational { int numerator; int denominator } localStruct;
bool adi_stdio_ControlDevice (HANDLE hndl, COMMAND cmd, void *generic) {
switch (cmd) {
case ADI_STDIO_COMMAND_SET_UART_BAUD_RATE: {
localSpeed = (int)generic;
break;
}
case ADI_COMMAND_WITH_STRING_ARG: {
if (localString) free(localString);
localString = strdup((char*)generic);
break;
}
case ADI_COMMAND_WITH_DOUBLE_PTR_ARG: {
localPi = *((double*)generic));
break;
}
case ADI_COMMAND_WITH_STRUCT_PTR: {
memcpy(localStruct, generic, sizeof(localStruct));
break;
}
}
}
Other commands (such as the fake ones I've added) would be able to use the generic argument in a variety of ways, as integers or other pointer types for example.
This is actually supported by the documentation (VisualDSP++ 5.0 Device Drivers and System Services Manual for Blackfin® Processors) for that call, which states:
ADI_STDIO_RESULT adi_stdio_ControlDevice (
ADI_STDIO_DEVICE_HANDLE hStdioDevice,
uint32_t nCommandID,
void *const pValue
);
: : :
pValue: Argument required for executing the command. Depending upon the command, different types of arguments are required.

CUDA streams, texture binding and async memcpy

Writing some signal processing in CUDA I recently made huge progress in optimizing it. By using 1D textures and adjusting my access patterns I managed to get a 10× performance boost. (I previously tried transaction aligned prefetching from global into shared memory, but the nonuniform access patterns happening later messed up the warp→shared cache bank association (I think)).
So now I'm facing the problem, how CUDA textures and bindings interact with asynchronous memcpy.
Consider the following kernel
texture<...> mytexture;
__global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = tex1Dfetch(texture, threadIdx.x);
}
The kernel is launched in multiple streams
extern void *sourcedata;
#define N_CUDA_STREAMS ...
cudaStream stream[N_CUDA_STREAMS];
void *d_pOut[N_CUDA_STREAMS];
void *d_texData[N_CUDA_STREAMS];
for(int k_stream = 0; k_stream < N_CUDA_STREAMS; k_stream++) {
cudaStreamCreate(stream[k_stream]);
cudaMalloc(&d_pOut[k_stream], ...);
cudaMalloc(&d_texData[k_stream], ...);
}
/* ... */
for(int i_datablock; i_datablock < n_datablocks; i_datablock++) {
int const k_stream = i_datablock % N_CUDA_STREAMS;
cudaMemcpyAsync(d_texData[k_stream], (char*)sourcedata + i_datablock * blocksize, ..., stream[k_stream]);
cudaBindTexture(0, &mytexture, d_texData[k_stream], ...);
mykernel<<<..., stream[k_stream]>>>(d_pOut);
}
Now what I wonder about is, since there is only one texture reference, what happens when I bind a buffer to a texture while other streams' kernels access that texture? cudaBindStream doesn't take a stream parameter, so I'm worried that by binding the texture to another device pointer while running kernels are asynchronously accessing said texture I'll divert their accesses to the other data.
The CUDA documentation doesn't tell anything about this. If have to to disentangle this to allow concurrent access, it seems I'd have to create a number of texture references and use a switch statementto chose between them, based on the stream number passed as a kernel launch parameter.
Unfortunately CUDA doesn't allow to put arrays of textures on the device side, i.e. the following does not work:
texture<...> texarray[N_CUDA_STREAMS];
Layered textures are not an option, because the amount of data I have only fits within a plain 1D texture not bound to a CUDA array (see table F-2 in the CUDA 4.2 C Programming Guide).
Indeed you cannot unbind the texture while still using it in a different stream.
Since the number of streams doesn't need to be large to hide the asynchronous memcpys (2 would already do), you could use C++ templates to give each stream its own texture:
texture<float, 1, cudaReadModeElementType> mytexture1;
texture<float, 1, cudaReadModeElementType> mytexture2;
template<int TexSel> __device__ float myTex1Dfetch(int x);
template<> __device__ float myTex1Dfetch<1>(int x) { return tex1Dfetch(mytexture1, x); }
template<> __device__ float myTex1Dfetch<2>(int x) { return tex1Dfetch(mytexture2, x); }
template<int TexSel> __global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = myTex1Dfetch<TexSel>(threadIdx.x);
}
int main(void)
{
float *out_d[2];
// ...
mykernel<1><<<blocks, threads, stream[0]>>>(out_d[0]);
mykernel<2><<<blocks, threads, stream[1]>>>(out_d[1]);
// ...
}

Resources