pointers size with respect to RAM, architecture - pointers

there were many questions what determines size of a pointer.
basically as a rule of thumb you can say this is processor architecture,
x86 -> 4 bytes pointer
x64 -> 8 bytes pointer
I have seen also that some people here say it is system bus that is responsible for it, but other denied. Let's say architecture tells me what is the size of a pointer.
To address 4GB of RAM you need 4,294,967,296 mappings and pointer of size 4 bytes can address 4,294,967,296 memory locations.
To address 8GB of RAM you need 8,589,934,592 mappings and pointer of size 4 bytes cannot address all possible values.
so this is why I cannot have more than 4GB RAM on x86 architecture?

Amount of RAM is not limited by the architecture (32 or 64 bit). Architecture only decides how much memory can be addressed at a time, by the OS and the programs running on it. On a 32-bit machine, that is, a machine with 32-bit wide memory bus, the OS and the programs can "see" only 4 GB of memory. But that doesn't mean there is only 4 GB of RAM. If the manufacturer has provided for it, you can have 16 GB or 4x4 GB of RAM. In that case, there will be 2 more "hidden" address lines and also there'd be hardcoded logic to decide the levels of those 2 lines, thus selecting any of the available 4 GB RAMs - 00 01 10 11 . These "hidden" address bits are not used by the software layers, so for these layers, they can only use a 4-byte pointer. The number of these "hidden" address lines decides by how much you can extend your RAM.
This is just one example. It depends on the vendor, how they decide to provide for the extra RAM.

Related

How does Intel DSA (Data Streaming Accelerators) can store the descriptors in the local sram?

I would like to ask a question regarding to Intel DSA (Data Streaming Accelerator).
According to the documentation, Intel DSA can support up to 256 queues and 65536 entries(the bit width of registers are 8 and 16 respectively). Are all descriptors stored locally in the sram, or is the hardware written to the DDR?
I think such a large number of descriptors and queues will take up a lot of chip area, and the hardware circuit timing and congestion is difficult to control.

cheat engine - how to trace a pointer from a 6-bytes address?

I am trying to learn basic programming with Cheat Engine and games.
So far, I still can't grasp the pointer, particularly how to trace them.
Most of the tutorials on pointers work with 4-bytes long addresses, but what I have is 6-bytes long address. So far I have failed to track down the base address from this 6-bytes long address.
As shown in the screenshot, R9 is the offset and RCX should lead back to the pointer. R9 stays the same while RCX changing each time the game restart. Where should I go from here?
32bit Address space uses 32bit(4 Bytes) for memory addressing, while 64bit Address space uses 64bit(8 Bytes) for memory addressing.
In practice, 64bit is much, much more than required (larger than the estimated storage size of the entire internet) and hence systems have decided to use 48bit(6 Bytes) to address their memory.
Since most programming languages and computers in general only support 32bit and 64bit(do not support 48bit), the 48bit address is stored in a 64bit variable/register, with the higher most significant Bytes being zero (0x0000)
Therefore, in order to scan for the pointer value, you have to scan for an 8Byte value(with hex value being ticked as CE shows address values as hex by default)

SNMP Network traffic difference between 32-bit and 64-bit Counter

I try to monitor some Cisco 2960x Switch devices and I know there are the SNMP OID's ifInOctets (32-bit Counter) and ifHCinOctets (64-bit Counter). Can someone explain me whats the difference between those Counters and which one I should use for a GigabitEthernet Interface?
Thanks
64-bit counters are used for high capacity interfaces where 32-bit counters do not provide enough capacity and wrap too fast.
As the speed of network media increases, the minimum time in which a 32-bit counter wraps decreases.
For example, a 10 Mbps stream of back-to-back, full-size packets causes ifInOctets to wrap in just over 57 minutes. At 100 Mbps, the minimum wrap time is 5.7 minutes, and at 1 Gbps, the minimum is 34 seconds.
For interfaces that operate at 20,000,000 (20 million) bits per second or less, you must use 32-bit byte and packet counters. For interfaces that operate faster than 20 million bits per second, and slower than 650,000,000 bits per second, you must use 32-bit packet counters and 64-bit octet counters. For interfaces that operate at 650,000,000 bits/second or faster, 64-bit packet and octet counters must be used.
You can check the following link for detailed explanation on SNMP counters.

How memory is mapped to gpu (opencl Intel graphics)

I'm using intel integrated gpu for my opencl implementation. I am implementing a program with zero copy where I'm not copying the data to the gpu instead it shares the common memory(RAM).
I have a 64bit cpu, but in gpu specs it shows it has only 32 bit addressing mode.
I'm sharing a malloc heap space between the gpu and cpu and when I print the address I see the following.
In GPU:
if(id==0){
printf("Mem address: %p\n",A);
//Outputs Mem address: 0x1010000
In CPU: it prints
printf("Outside Mem address: %p\n",cpuA);
Device: Intel(R) HD Graphics IvyBridge M GT2
Outside Mem address: 0x7fcd529d9000
I am not getting how is it mapped in gpu. And I wonder if 2^28/2^32 is the maximum address gpu could access?.
The memory address you are printing on the host is a virtual address that only makes sense in the context of your program's process. In the CPU, this is transparently translated to a physical RAM page, the address of which is unrelated to the virtual address but stored in a lookup table (page table) maintained by the operating system. Note that "64-bit CPU" typically refers to the number of bits in a virtual address. (Although many 64-bit CPUs actually ignore 8-16 bits.) The number of bits for physical addresses (for addressing physical RAM cells and mapped device memory) is often a lot less, as little as 40 bits.
Devices attached to the system and able to perform direct memory accesses (DMA) most commonly deal with physical memory addresses. If your Intel GPU does not have an internal memory mapping scheme (and there is no IOMMU active, see below) then the address you are seeing in your OpenCL kernel code is probably a physical memory address. If the device can only address 32 bits, this means it can only access the first 4GiB of physical memory in your system. By assigning memory above 4GiB to devices and user space processes that aren't affected by a 32-bit restriction, or by using "bounce buffers", the operating system can arrange for any buffers used by the restricted device to be located in that memory area, regardless of virtual address.
Recently, IOMMUs have become common. These introduce a virtual memory like mapping system for devices as well - so the memory addresses a device sees are again unrelated to the physical addresses of system memory they correspond to. This is primarily a security feature - ideally, each device gets its own address space, so devices can't accidentally or deliberately access system memory they should not be accessing. It also means that the 32-bit limitation becomes completely irrelevant, as each device gets its own 32-bit address space which can be mapped to physical memory beyond the 4GiB boundary.

PCIe Bandwidth on ATI FirePro

I am trying to measure PCIe Bandwidth on ATI FirePro 8750. The amd app sample PCIeBandwidth in the SDK measures the bandwith of transfers from:
Host to device, using clEnqueueReadBuffer().
Device to host, using clEnqueueWriteBuffer().
On my system (windows 7, Intel Core2Duo 32 bit) the output is coming like this:
Selected Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : ATI RV770
Host to device : 0.412435 GB/s
Device to host : 0.792844 GB/s
This particular card has 2 GB DRAM and max clock frequency is 750 Mhz
1- Why is bandwidth different in each direction?
2- Why is the Bandwdith so small?
Also I understand that this communication takes place through DMA, so the Bandwidth may not be affected by CPU.
This paper from Microsoft Research labs give some inkling of why there is asymmetric PCIe data transfer bandwidth between GPU - CPU. The paper describes performance metrics for FPGA - GPU data transfer bandwidth over PCIe. It also includes metrics from CPU - GPU data transfer bandwidth over PCIe.
To quote the relevant section
'it should also be noted that the GPU-CPU transfers themselves also
show some degree of asymmetric behavior. In the case of a GPU to CPU
transfer, where the GPU is initiating bus master writes, the GPU
reaches a maximum of
6.18 GByte/Sec. In the opposite direction from CPU to GPU, the GPU is initiating bus master reads and the resulting bandwidth falls to 5.61
GByte/Sec. In our observations it is typically the case that bus
master writes are more efficient than bus master reads for any PCIe
implementation due to protocol overhead and the relative complexity of
implementation. While a possible solution to this asymmetry would be
to handle the CPU to GPU direction by using CPU initiated bus master
writes, that hardware facility is not available in the PC architecture
in general. '
The answer to the second question on bandwidth could be due units of data transfer size.
See figs 2,3,4 and 5. I have also seen graphs like this at the 1st AMD Fusion Conference. The explanation is that the PCIe transfer of data has overheads due to the protocol and the device latency. The overheads are more significant for small transfer sizes and become less significant for larger sizes.
What levers do you have to control or improve performance?
Getting the right combo of chip/motherboard and GPU is the H/W lever. Chips with the max number of PCIe lanes are better. Using a higher spec PCIe protocol, PCIe 3.0 is better than PCIe 2.0. All components need to support the higher standards.
As a programmer controlling the data transfer size, is a very important lever.
Transfer sizes of 128K - 256K bytes get approx 50% of the max bandwidth. Transfers of 1M - 2M bytes get over 90% of max bandwidth.

Resources