How memory is mapped to gpu (opencl Intel graphics)

How memory is mapped to gpu (opencl Intel graphics) - opencl

I'm using intel integrated gpu for my opencl implementation. I am implementing a program with zero copy where I'm not copying the data to the gpu instead it shares the common memory(RAM).
I have a 64bit cpu, but in gpu specs it shows it has only 32 bit addressing mode.
I'm sharing a malloc heap space between the gpu and cpu and when I print the address I see the following.
In GPU:
if(id==0){
printf("Mem address: %p\n",A);
//Outputs Mem address: 0x1010000
In CPU: it prints
printf("Outside Mem address: %p\n",cpuA);
Device: Intel(R) HD Graphics IvyBridge M GT2
Outside Mem address: 0x7fcd529d9000
I am not getting how is it mapped in gpu. And I wonder if 2^28/2^32 is the maximum address gpu could access?.

The memory address you are printing on the host is a virtual address that only makes sense in the context of your program's process. In the CPU, this is transparently translated to a physical RAM page, the address of which is unrelated to the virtual address but stored in a lookup table (page table) maintained by the operating system. Note that "64-bit CPU" typically refers to the number of bits in a virtual address. (Although many 64-bit CPUs actually ignore 8-16 bits.) The number of bits for physical addresses (for addressing physical RAM cells and mapped device memory) is often a lot less, as little as 40 bits.
Devices attached to the system and able to perform direct memory accesses (DMA) most commonly deal with physical memory addresses. If your Intel GPU does not have an internal memory mapping scheme (and there is no IOMMU active, see below) then the address you are seeing in your OpenCL kernel code is probably a physical memory address. If the device can only address 32 bits, this means it can only access the first 4GiB of physical memory in your system. By assigning memory above 4GiB to devices and user space processes that aren't affected by a 32-bit restriction, or by using "bounce buffers", the operating system can arrange for any buffers used by the restricted device to be located in that memory area, regardless of virtual address.
Recently, IOMMUs have become common. These introduce a virtual memory like mapping system for devices as well - so the memory addresses a device sees are again unrelated to the physical addresses of system memory they correspond to. This is primarily a security feature - ideally, each device gets its own address space, so devices can't accidentally or deliberately access system memory they should not be accessing. It also means that the 32-bit limitation becomes completely irrelevant, as each device gets its own 32-bit address space which can be mapped to physical memory beyond the 4GiB boundary.

Related

cheat engine - how to trace a pointer from a 6-bytes address?

I am trying to learn basic programming with Cheat Engine and games.
So far, I still can't grasp the pointer, particularly how to trace them.
Most of the tutorials on pointers work with 4-bytes long addresses, but what I have is 6-bytes long address. So far I have failed to track down the base address from this 6-bytes long address.
As shown in the screenshot, R9 is the offset and RCX should lead back to the pointer. R9 stays the same while RCX changing each time the game restart. Where should I go from here?

32bit Address space uses 32bit(4 Bytes) for memory addressing, while 64bit Address space uses 64bit(8 Bytes) for memory addressing.
In practice, 64bit is much, much more than required (larger than the estimated storage size of the entire internet) and hence systems have decided to use 48bit(6 Bytes) to address their memory.
Since most programming languages and computers in general only support 32bit and 64bit(do not support 48bit), the 48bit address is stored in a 64bit variable/register, with the higher most significant Bytes being zero (0x0000)
Therefore, in order to scan for the pointer value, you have to scan for an 8Byte value(with hex value being ticked as CE shows address values as hex by default)

NIC behaviour in bridged adaptor mode

I was always in impression that, NIC card has unique MAC address and if incoming packet matches with that MAC, it lifts the packet and sends to kernel.
Recently when I have installed VMbox(host - Ubuntu, guest OS - Ubuntu) and configured network option with "bridge adapter" mode(MAC is randomly chosen), Vmbox is acting like a independent machine. I mean VM box OS has it's own MAC address and public IP.
I have observed that packets send on wire from VM box have virtual MAC & same for incoming packets.
1) Does NICs allow to send network packet with MAC id different from physical MAC id? and same for incoming packets, is it ok to lift packets where MAC is not matching with physical MAC id?( As I understand this is only possible in promiscuous mode)
2) Is n't it security violation? how about flooding internet by allocating more MACs by creating multiple instances of VM on many machines?
3) If MAC id is chosen randomly, there can be possibility that MAC id will be matching with with some other network device, how is this addressed?
Thank You,
Gopinath.

1) Regular NICs operate on Layer-1, it is responsibility of OS (and respective kernelspace or userspace drivers) to provide valid Ethernet frame, using (if needed) vendor-related MAC address stored (for reference) in network card's memory. Whether the frame comes from host OS or from guest OS (through virtual switch in hypervisor) is irrelevant. The situation becomes slightly different in case of NFV and smart NICs, but not much.
The whole point of virtualization is that you shouldn't tell the difference from running your OS on virtual server or on standalone machine standing next to your host (looking either from the inside of your system or from the outside).
2) No, security don't get worse through that. As mentioned in previous point, the situation would be similar if you put physical host next to another. And from the security point of view, it's easier to flood local network with packets with forged source MAC than to instantiate the same number of VMs.
3) Collisions affect local network the same way as with regular host. The possibility is always there, but the probability is extremely low, with 6*8-1 bits to chose random address from. I've seen only one such collision, and only because MACs were set manually, not picked randomly.

Reserve GPU for one exclusive OpenCL host programme

Is there a possibility to exclusively reserve the GPU for an OpenCL host programme?
No other process shall have access to this device via OpenCL or OpenGL.
Background is that my software calculates real time data on the GPU and therefore it's not good for the performance if the GPU is doing other stuff as well.

pointers size with respect to RAM, architecture

there were many questions what determines size of a pointer.
basically as a rule of thumb you can say this is processor architecture,
x86 -> 4 bytes pointer
x64 -> 8 bytes pointer
I have seen also that some people here say it is system bus that is responsible for it, but other denied. Let's say architecture tells me what is the size of a pointer.
To address 4GB of RAM you need 4,294,967,296 mappings and pointer of size 4 bytes can address 4,294,967,296 memory locations.
To address 8GB of RAM you need 8,589,934,592 mappings and pointer of size 4 bytes cannot address all possible values.
so this is why I cannot have more than 4GB RAM on x86 architecture?

Amount of RAM is not limited by the architecture (32 or 64 bit). Architecture only decides how much memory can be addressed at a time, by the OS and the programs running on it. On a 32-bit machine, that is, a machine with 32-bit wide memory bus, the OS and the programs can "see" only 4 GB of memory. But that doesn't mean there is only 4 GB of RAM. If the manufacturer has provided for it, you can have 16 GB or 4x4 GB of RAM. In that case, there will be 2 more "hidden" address lines and also there'd be hardcoded logic to decide the levels of those 2 lines, thus selecting any of the available 4 GB RAMs - 00 01 10 11 . These "hidden" address bits are not used by the software layers, so for these layers, they can only use a 4-byte pointer. The number of these "hidden" address lines decides by how much you can extend your RAM.
This is just one example. It depends on the vendor, how they decide to provide for the extra RAM.

OpenCL data transfer and DMA

In the AMD APP programming guide it is written that (p.no 4-15):
For transfers <=32 kB: For transfers from the host to device, the data is copied by the CPU
to a runtime pinned host memory buffer, and the DMA engine transfers the
data to device memory. The opposite is done for transfers from the device to
the host.
Is the above DMA, the CPU DMA engine or the GPU DMA engine?

I believe it is the GPU DMA engine since on some cards (e.g., NVIDIA) you can do simultaneous read and write (so this is a GPU capability and not a CPU capability).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex