CUDA/C++: Passing __device__ pointers in C++ code - pointers

I am developing a Windows 64-bit application that will manage concurrent execution of different CUDA-algorithms on several GPUs.
My design requires a way of passing pointers to device memory
around c++ code. (E.g. remember them as members in my c++ objects).
I know that it is impossible to declare class members with __device__ qualifiers.
However I couldn't find a definite answer whether assigning __device__ pointer to a normal C pointer and then using the latter works. In other words: Is the following code valid?
__device__ float *ptr;
cudaMalloc(&ptr, size);
float *ptr2 = ptr
some_kernel<<<1,1>>>(ptr2);
For me it compiled and behaved correctly but I would like to know whether it is guaranteed to be correct.

No, that code isn't strictly valid. While it might work on the host side (more or less by accident), if you tried to dereference ptr directly from device code, you would find it would have an invalid value.
The correct way to do what your code implies would be like this:
__device__ float *ptr;
__global__ void some_kernel()
{
float val = ptr[threadIdx.x];
....
}
float *ptr2;
cudaMalloc(&ptr2, size);
cudaMemcpyToSymbol("ptr", ptr2, sizeof(float *));
some_kernel<<<1,1>>>();
for CUDA 4.x or newer, change the cudaMemcpyToSymbol to:
cudaMemcpyToSymbol(ptr, ptr2, sizeof(float *));
If the static device symbol ptr is really superfluous, you can just to something like this:
float *ptr2;
cudaMalloc(&ptr2, size);
some_kernel<<<1,1>>>(ptr2);
But I suspect that what you are probably looking for is something like the thrust library device_ptr class, which is a nice abstraction wrapping the naked device pointer and makes it absolutely clear in code what is in device memory and what is in host memory.

Related

Is there a way to specify the base address of a shared library using dlopen()?

It seems that when we dlopen() some libraries, they will be loaded into some preferred (but not fixed) addresses. I've checked the source code of dlopen(), and a core function says
static __always_inline const char *
_dl_map_segments (struct link_map *l, int fd,
const ElfW(Ehdr) *header, int type,
const struct loadcmd loadcmds[], size_t nloadcmds,
const size_t maplength, bool has_holes,
struct link_map *loader)
{
const struct loadcmd *c = loadcmds;
if (__glibc_likely (type == ET_DYN))
{
/* This is a position-independent shared object. We can let the
kernel map it anywhere it likes, but we must have space for all
the segments in their specified positions relative to the first.
So we map the first segment without MAP_FIXED, but with its
extent increased to cover all the segments. Then we remove
access from excess portion, and there is known sufficient space
there to remap from the later segments.
As a refinement, sometimes we have an address that we would
prefer to map such objects at; but this is only a preference,
the OS can do whatever it likes. */
ElfW(Addr) mappref
= (ELF_PREFERRED_ADDRESS (loader, maplength,
c->mapstart & GLRO(dl_use_load_bias))
- MAP_BASE_ADDR (l));
/* Remember which part of the address space this object uses. */
l->l_map_start = (ElfW(Addr)) __mmap ((void *) mappref, maplength,
c->prot,
MAP_COPY|MAP_FILE,
fd, c->mapoff);
if (__glibc_unlikely ((void *) l->l_map_start == MAP_FAILED))
return DL_MAP_SEGMENTS_ERROR_MAP_SEGMENT;
...
}
The comment says you can specify a preferred address, but OS will determine whether to use it.
Question
Is there any way we can specify the base address for each dlopened module?
ELF_PREFERRED_ADDRESSS is set to 0 by default, but this macro seems to infer that the preferred addresses can be changed, say by an environment variable? But even there is one, I doubt that it can be changed for each dlopened library.
If I want to implement this feature myself, it seems that I need to wrap a new dlopen function and pass the preferred address to the above core function (and use MAP_FIXED maybe). Is it correct?
Thanks!
Is there any way we can specify the base address for each dlopened module?
No.
ELF_PREFERRED_ADDRESSS is set to 0 by default, but this macro seems to infer that the preferred addresses can be changed, say by an environment variable? But even there is one, I doubt that it can be changed for each dlopened library.
This code is compiled into the dynamic loader ld-linux.so and can not be changed after the compilation.
If I want to implement this feature myself, it seems that I need to wrap a new dlopen function and pass the preferred address to the above core function (and use MAP_FIXED maybe). Is it correct?
The function is private to ld-linux. You will not be able to wrap it, or call it from outside of ld-linux.
P.S. What you are likely looking for is the prelink command.

Pointer to a register on a 16 bit controller

How do you declare a pointer on a 16 bit Renesas RL78 microcontroller using IAR's EWB RL78 compiler to a register which has a 20 bit address?
Ex:
static int *ptr = (int *)0xF1000;
The above does not work because pointers are 16 bit addresses.
If the register in question is an on-chip peripheral, then it is likely that your toolchain already includes a processor header with all registers declared, in which case you should use that. If for some reason you cannot or do not wish to do that, then you could at least look at that to see how it declares such registers.
In any event you should at least declare the address volatile since it is not a regular memory location and may change beyond the control and knowledge of your code as part of the normal peripheral behaviour. Moreover you should use explicit sized data types and it is unlikely that this register is signed.
#include <stdint.h>
...
static volatile uint16_t* ptr = (uint16_t*)0xF1000u ;
Added following clarification of target architecture:
The IAR RL78 compiler supports two data models - near and far. From the IAR compiler manual:
● The Near data model can access data in the highest 64 Kbytes of data
memory
● The Far data model can address data in the entire 1 Mbytes of
data memory.
The near model is the default. The far model may be set using the compiler option: --data_model=far; this will globally change the pointer type to allow 20 bit addressing (pointers are 3 bytes long in this case).
Even without specifying the data model globally it is possible to override the default pointer type by explicitly specifying the pointer type using the keywords __near and __far. So in the example in the question the correct declaration would be:
static volatile uint16_t __far* ptr = (uint16_t*)0xF1000u ;
Note the position of the __far keyword is critical. Its position can be used to declare a pointer to far memory, or a pointer in far memory (or you can even declare both to and in far memory).
On an RL78, 0xF1000 in fact refers to the start of data flash rather then a register as stated in the question. Typically a pointer to a register would not be subject to alteration (which would mean it referred to a different register), so might reasonably be declared const:
static volatile uint16_t __far* const ptr = (uint16_t*)0xF1000u ;
Similarly to __far the position of const is critical to the semantics. The above prevents ptr from being modified but allows what ptr refers to to be modified. Being flash memory, this may not always be desirable or possible, so it is possible that it could reasonably be declared a const pointer to a const value.
Note that for RL78 Special Function Registers (SFR) the IAR compiler has a keyword __sfr specifically for addressing registers in the area 0xFFF00-0xFFFFF:
Example:
#pragma location=0xFFF20
__no_init volatile uint8_t __sfr PORT1; // PORT1 is located at address 0xFFF20
Alternative syntax using IAR specfic compiler extension:
__no_init volatile uint8_t __sfr PORT1 # 0xFFF20 ;

Specifying multiple address spaces for a type in the list of arguments of function

I wrote a function in OpenCL:
void sort(int* array, int size)
{
}
and I need to call the function once over a __private array and once over a __global array. Apparently, it's not allowed in OpenCL to specify multiple address spaces for a type. Therefore, I should duplicate the declaration of function, while they have exactly the same body:
void sort_g(__global int* array, int size)
{
}
void sort_p(__private int* array, int size)
{
}
This is very inefficient for maintaining the code and I am wondering if there is a better way to manage multiple address spaces in OpenCL or not?
P.S.: I don't see why OpenCL doesn't allow multiple address spaces for a type. Compiler could generate multiple instances of the function (one per address space) and use them appropriately once they're called in the kernel.
For OpenCL < 2.0, this is how the language is designed and there is no getting around it, regrettably.
For OpenCL >= 2.0, with the introduction of the generic address space, your first piece of code works as you would expect.
In short, upgrading to 2.0 would solve your problem (and bring in other niceties), otherwise you're out of luck (you could perhaps wrap your function in a macro, but ew, macros).

What's the point of unique_ptr?

Isn't a unique_ptr essentially the same as a direct instance of the object? I mean, there are a few differences with dynamic inheritance, and performance, but is that all unique_ptr does?
Consider this code to see what I mean. Isn't this:
#include <iostream>
#include <memory>
using namespace std;
void print(int a) {
cout << a << "\n";
}
int main()
{
unique_ptr<int> a(new int);
print(*a);
return 0;
}
Almost exactly the same as this:
#include <iostream>
#include <memory>
using namespace std;
void print(int a) {
cout << a << "\n";
}
int main()
{
int a;
print(a);
return 0;
}
Or am I misunderstanding what unique_ptr should be used for?
In addition to cases mentioned by Chris Pitman, one more case you will want to use std::unique_ptr is if you instantiate sufficiently large objects, then it makes sense to do it in the heap, rather than on a stack. The stack size is not unlimited and sooner or later you might run into stack overflow. That is where std::unique_ptr would be useful.
The purpose of std::unique_ptr is to provide automatic and exception-safe deallocation of dynamically allocated memory (unlike a raw pointer that must be explicitly deleted in order to be freed and that is easy to inadvertently not get freed in the case of interleaved exceptions).
Your question, though, is more about the value of pointers in general than about std::unique_ptr specifically. For simple builtin types like int, there generally is very little reason to use a pointer rather than simply passing or storing the object by value. However, there are three cases where pointers are necessary or useful:
Representing a separate "not set" or "invalid" value.
Allowing modification.
Allowing for different polymorphic runtime types.
Invalid or not set
A pointer supports an additional nullptr value indicating that the pointer has not been set. For example, if you want to support all values of a given type (e.g. the entire range of integers) but also represent the notion that the user never input a value in the interface, that would be a case for using a std::unique_ptr<int>, because you could get whether the pointer is null or not as a way of indicating whether it was set (without having to throw away a valid value of integer just to use that specific value as an invalid, "sentinel" value denoting that it wasn't set).
Allowing modification
This can also be accomplished with references rather than pointers, but pointers are one way of doing this. If you use a regular value, then you are dealing with a copy of the original, and any modifications only affect that copy. If you use a pointer or a reference, you can make your modifications seen to the owner of the original instance. With a unique pointer, you can additionally be assured that no one else has a copy, so it is safe to modify without locking.
Polymorphic types
This can likewise be done with references, not just with pointers, but there are cases where due to semantics of ownership or allocation, you would want to use a pointer to do this... When it comes to user-defined types, it is possible to create a hierarchical "inheritance" relationship. If you want your code to operate on all variations of a given type, then you would need to use a pointer or reference to the base type. A common reason to use std::unique_ptr<> for something like this would be if the object is constructed through a factory where the class you are defining maintains ownership of the constructed object. For example:
class Airline {
public:
Airline(const AirplaneFactory& factory);
// ...
private:
// ...
void AddAirplaneToInventory();
// Can create many different type of airplanes, such as
// a Boeing747 or an Airbus320
const AirplaneFactory& airplane_factory_;
std::vector<std::unique_ptr<Airplane>> airplanes_;
};
// ...
void Airline::AddAirplaneToInventory() {
airplanes_.push_back(airplane_factory_.Create());
}
As you mentioned, virtual classes are one use case. Beyond that, here are two others:
Optional instances of objects. My class may delay instantiating an instance of the object. To do so, I need to use memory allocation but still want the benefits of RAII.
Integrating with C libraries or other libraries that love returning naked pointers. For example, OpenSSL returns pointers from many (poorly documented) methods, some of which you need to cleanup. Having a non-copyable pointer container is perfect for this case, since I can protect it as soon as it is returned.
A unique_ptr functions the same as a normal pointer except that you do not have to remember to free it (in fact it is simply a wrapper around a pointer). After you allocate the memory, you do not have to afterwards call delete on the pointer since the destructor on unique_ptr takes care of this for you.
Two things come to my mind:
You can use it as a generic exception-safe RAII wrapper. Any resource that has a "close" function can be wrapped with unique_ptr easily by using a custom deleter.
There are also times you might have to move a pointer around without knowing its lifetime explicitly. If the only constraint you know is uniqueness, then unique_ptr is an easy solution. You could almost always do manual memory management also in that case, but it is not automatically exception safe and you could forget to delete. Or the position you have to delete in your code could change. The unique_ptr solution could easily be more maintainable.

How do game trainers change an address in memory that's dynamic?

Lets assume I am a game and I have a global int* that contains my health. A game trainer's job is to modify this value to whatever in order to achieve god mode. I've looked up tutorials on game trainers to understand how they work, and the general idea is to use a memory scanner to try and find the address of a certain value. Then modify this address by injecting a dll or whatever.
But I made a simple program with a global int* and its address changes every time I run the app, so I don't get how game trainers can hard code these addresses? Or is my example wrong?
What am I missing?
The way this is usually done is by tracing the pointer chain from a static variable up to the heap address containing the variable in question. For example:
struct CharacterStats
{
int health;
// ...
}
class Character
{
public:
CharacterStats* stats;
// ...
void hit(int damage)
{
stats->health -= damage;
if (stats->health <= 0)
die();
}
}
class Game
{
public:
Character* main_character;
vector<Character*> enemies;
// ...
}
Game* game;
void main()
{
game = new Game();
game->main_character = new Character();
game->main_character->stats = new CharacterStats;
// ...
}
In this case, if you follow mikek3332002's advice and set a breakpoint inside the Character::hit() function and nop out the subtraction, it would cause all characters, including enemies, to be invulnerable. The solution is to find the address of the "game" variable (which should reside in the data segment or a function's stack), and follow all the pointers until you find the address of the health variable.
Some tools, e.g. Cheat Engine, have functionality to automate this, and attempt to find the pointer chain by themselves. You will probably have to resort to reverse-engineering for more complicated cases, though.
Discovery of the access pointers is quite cumbersome and static memory values are difficult to adapt to different compilers or game versions.
With API hooking of malloc(), free(), etc. there is a different method than following pointers. Discovery starts with recording all dynamic memory allocations and doing memory search in parallel. The found heap memory address is then reverse matched against the recorded memory allocations. You get to know the size of the object and the offset of your value within the object. You repeat this with backtracing and get the jump-back code address of a malloc() call or a C++ constructor. With that information you can track and modify all objects which get allocated from there. You dump the objects and compare them and find a lot more interesting values. E.g. the universal elite game trainer "ugtrain" does it like this on Linux. It uses LD_PRELOAD.
Adaption works by "objdump -D"-based disassembly and just searching for the library function call with the known memory size in it.
See: http://en.wikipedia.org/wiki/Trainer_%28games%29
Ugtrain source: https://github.com/sriemer/ugtrain
The malloc() hook looks like this:
static __thread bool no_hook = false;
void *malloc (size_t size)
{
void *mem_addr;
static void *(*orig_malloc)(size_t size) = NULL;
/* handle malloc() recursion correctly */
if (no_hook)
return orig_malloc(size);
/* get the libc malloc function */
no_hook = true;
if (!orig_malloc)
*(void **) (&orig_malloc) = dlsym(RTLD_NEXT, "malloc");
mem_addr = orig_malloc(size);
/* real magic -> backtrace and send out spied information */
postprocess_malloc(size, mem_addr);
no_hook = false;
return mem_addr;
}
But if the found memory address is located within the executable or a library in memory, then ASLR is likely the cause for the dynamic. On Linux, libraries are PIC (position-independent code) and with latest distributions all executables are PIE (position-independent executables).
EDIT: never mind it seems it was just good luck, however the last 3 numbers of the pointer seem to stay the same. Perhaps this is ASLR kicking in and changing the base image address or something?
aaahhhh my bad, i was using %d for printf to print the address and not %p. After using %p the address stayed the same
#include <stdio.h>
int *something = NULL;
int main()
{
something = new int;
*something = 5;
fprintf(stdout, "Address of something: %p\nValue of something: %d\nPointer Address of something: %p", &something, *something, something);
getchar();
return 0;
}
Example for a dynamicaly allocated varible
The value I want to find is the number of lives to stop my lives from being reduced to 0 and getting game over.
Play the Game and search for the location of the lifes variable this instance.
Once found use a disassembler/debugger to watch that location for changes.
Lose a life.
The debugger should have reported the address that the decrement occurred.
Replace that instruction with no-ops
Got this pattern from the program called tsearch
A few related websites found from researching this topic:
http://deviatedhacking.com/index.php?/topic/75-dynamic-memory-allocation/
http://www.edgeofnowhere.cc/viewforum.php?f=183
http://www.oldschoolhack.de/tutorials/Theories%20and%20methods%20of%20code-caves.htm
http://webcache.googleusercontent.com/search?q=cache:4wzMzFIZx54J:gamehacking.com/forums/tutorials-beginners/11597-c-making-game-trainer.html+reading+a+dynamic+memory+address+game+trainer&cd=2&hl=en&ct=clnk&gl=au&client=firefox-a (A google cache version)
http://www.codeproject.com/KB/cpp/codecave.aspx
The way things like Gameshark codes were figured out were by dumping the memory image of the application, then doing one thing, then looking to see what changed. There might be a few things changing, but there should be patterns to look for. E.g. dump memory, shoot, dump memory, shoot again, dump memory, reload. Then look for changes and get an idea for where/how ammo is stored. For health it'll be similar, but a lot more things will be changing (since you'll be moving at the very least). It'll be easiest though to do it when minimizing the "external effects," e.g. don't try to diff memory dumps during a firefight because a lot is happening, do your diffs while standing in lava, or falling off a building, or something of that nature.

Resources