Intel cache Address - intel

Here is the L3 cache (shared) configuration on my Intel Xeon Silver 4210R CPU-
$ getconf -a | grep LEVEL3_CACHE
LEVEL3_CACHE_SIZE 14417920
LEVEL3_CACHE_ASSOC 11
LEVEL3_CACHE_LINESIZE 64
This configuration implies that the number of sets in the cache is-
Now I am trying to understand the addressing of the cache.
Here, the cache line (or the block) size is 64 bytes and intel uses the byte-addressable system. Therefore, the least significant bits of cache address should be used for block offset.
With a similar calculation, the number of address bits that should be used for set indexing is , but this fraction value confuses me.
Am I missing something? How many bits are exactly used here for set indexing?
Edit: Below Eric mentioned in his answer that each of the 10 processor cores shares 1.375MiB of L3 Cache. But such a configuration raises another question in my mind. Let's assume that, I am running two processes in core-0 and core-1. If both processes use virtual address 0x0, will those virtual addresses be mapped to the same core's L3 cache (assuming VIPT cache)? In other words, as the L3 cache is shared, which part of the virtual address distinguishes the core-0 L3 cache from the core-1 L3 cache?

Am I missing something?
This processor has 10 cores — your formula doesn't account for # of cores, so if you divide by 10 it is an even multiple of 2.
How many bits are exactly used here for set indexing?
11 bits, I believe
L3$ 13.75 MiB 10x1.375 MiB 11-way set associative write-back
read more here: https://en.wikichip.org/wiki/intel/xeon_silver/4210r

Related

Does what define a cpu's address space?

My confusion is based on these 3 thoughts -
Is it 2^(number of all address pin available on the cpu)?
Is it the 2^(size of one specific register)?
Is it a hardware circuit which understands all the addresses within a range of addresses? Then what is it?
I'm not asking about virtual address space here, I don't know what it's called , maybe it's the physical address space of all physical devices including ram.
Besides even if I get a correct answer then I would like to ask, why does my cpu has 2^39 bits(512GB) of memory address space and 64KB+3 I/O memory space. This information is written on intel documentation for my system in package (intel core i3-4005U with an integrated lynx point-m PCH).
You are welcome to edit my question if I'm asking it wrong. Thank you.
The size of physical address space for a CPU is an arbitrary choice made by the designers. The width of cache tags and TLB entries have to be wide enough because caches are physically tagged (including L1d in most CPUs, including all Intel), and other internal things that deal with physical addresses. (Like the store buffer, for matching load addresses against outstanding stores. And also for matching stores against in-flight code addresses.)
All Haswell-client CPUs share the same core microarchitecture, so even though a laptop chip doesn't need that much, some single-socket non-Xeon desktops might. (I think this is true; saving a small amount of space and/or power by changing cache tag widths by 1 or 2 bits might plausible but IDK if Intel does that; I think they really only want to validate a design once.)
Remember that device memory (including PCIe card such as VGA or a Xeon Phi compute card) will normally be mapped into physical address space so the CPU can access it with loads/stores (after pointing virtual pages at those regions of physical address space). PCIe uses fixed width links and sends addresses as part of message "packets"; no extra pins are required for more addresses.
The DDR3 DRAM controllers have a number of address lines on each channel to send row/column addresses; it might be possible to leave one pin unused. It's very similar to other DDR versions; Wikipedia has diagram of some of the signals in the DDR4 article: https://en.wikipedia.org/wiki/DDR4_SDRAM#Command_encoding

Does NUMA impact memory bandwidth, or just latency?

I have a problem that is memory bandwidth limited -- I need to read a lot (many GB) of data sequentially from RAM, do some quick processing and write it sequentially to a different location in RAM. Memory latency is not a concern.
Is there any benefit from dividing the work between two or more cores in different NUMA zones? Equivalently, does working across zones reduce the available bandwidth?
For bandwidth-limited, multi-threaded code, the behavior in a NUMA system will primarily depend how "local" each thread's data accesses are, and secondarily on details of the remote accesses.
In a typical 2-socket server system, the local memory bandwidth available to two NUMA nodes is twice that available to a single node. (But remember that it may take many threads running on many cores to reach asymptotic bandwidth for each socket.)
The STREAM Benchmark, for example, is typically run in a configuration that allows almost all accesses from every thread to be "local". This is implemented by assuming "first touch" NUMA placement -- when allocated memory is first written, the OS has to create mappings from the process virtual address space to physical addresses, and (by default) the OS chooses physical addresses that are in the same NUMA node as the core that executed the store instruction.
"Local" bandwidth (to DRAM) in most systems is approximately symmetric (for reads and writes) and relatively easy to understand. "Remote" bandwidth is much more asymmetric for reads and writes, and there is usually significant contention between the read/write commands going between the chips and the data moving between the chips. The overall ratio of local to remote bandwidth also varies significantly across processor generations. For some processors (e.g., Xeon E5 v3 and probably v4), the interconnect is relatively fast, so jobs with poor locality can often be run with all of the memory interleaved between the two sockets.
Local bandwidths have increased significantly since then, with more recent processors generally strongly favoring local access.
Example from the Intel Xeon Platinum 8160 (2 UPI links between chips):
Local Bandwidth for Reads (each socket) ~112 GB/s
Remote Bandwidth for Reads (one-direction at a time) ~34 GB/s
Local bandwidth scales perfectly in two-socket systems, and remote bandwidth also scales very well when using both sockets (each socket reading data from the other socket).
It gets more complicated with combined read and write traffic between sockets, because the read traffic from node 0 to node 1 competes with the write traffic from node 1 to node 0, etc.
Local Bandwidth for 1R:1W (each socket) ~101 GB/s (reduced due to read/write scheduling overhead)
Remote Bandwidth for 1R:1W (one socket running at a time) ~50 GB/s -- more bandwidth is available because both directions are being used, but this also means that if both sockets are doing the same thing, there will be conflicts. I see less than 60 GB/s aggregate when both sockets are running 1R:1W remote at the same time.
Of course different ratios of local to remote accesses will change the scaling. Timing can also be an issue -- if the threads are doing local accesses at the same time, then remote accesses at the same time, there will be more contention in the remote access portion (compared to a case in which the threads are doing their remote accesses at different times).

Why does clCreateBuffer with CL_MEM_ALLOC_HOST_PTR use discrete device memory?

I have a piece of code in which I use clCreateBuffer with the CL_MEM_ALLOC_HOST_PTR flag and I realised that this allocates memory from the device. Is that correct and I'm missing something from the standard?
CL_MEM_ALLOC_HOST_PTR: This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory.
Personally I understood that that buffer should be a host-side buffer that, later on, can be mapped using clEnqueueMapBuffer.
Follows some info about the device I'm using:
Device: Tesla K40c
Hardware version: OpenCL 1.2 CUDA
Software version: 352.63
OpenCL C version: OpenCL C 1.2
It is described as
OpenCL implementations are allowed to cache the buffer contents
pointed to by host_ptr in device memory. This cached copy can be used
when kernels are executed on a device.
in
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateBuffer.html
The description is for CL_MEM_USE_HOST_PTR but it is only different by its allocator from CL_MEM_ALLOC_HOST_PTR. USE uses host-given pointer, ALLOC uses opencl implementation's own allocators return value.
The caching is not doable for some integrated-gpu types so its not always true.
The key phrase from the spec is host accessible:
This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory.
It doesn't say it'll be allocated in host memory: it says it'll be accessible by the host.
This includes any memory that can be mapped into CPU-visible memory addresses. Typically some, if not all VRAM in a discrete graphics device will be available through a PCI memory range exposed in one of the BARs - these get mapped into the CPU's physical memory address space by firmware or the OS. They can be used similarly to system memory in page tables and thus made available to user processes by mapping them to virtual memory addresses.
The spec even goes on to mention this possibility, at least in combination with another flag:
CL_MEM_COPY_HOST_PTR can be used with CL_MEM_ALLOC_HOST_PTR to initialize the contents of the cl_mem object allocated using host-accessible (e.g. PCIe) memory.
If you definitely want to use system memory for a buffer (may be a good choice if GPU access to it is sparse or less frequent than CPU acccess), allocate it yourself and wrap it in a buffer with CL_MEM_USE_HOST_PTR. (Which may still end up being cached in VRAM, depending on the implementation.)

setting of intel EPT Accessed and dirty FLags for guest page tables

I am reading Intel virtualization manual where manual says that if bit 6 of EPTP(a VM execution control field) is set, the processor will set the Accessed and dirty bits in relevant EPT entries according to some rules.
I am trying to understand that if processor sets A/D bits in EPT on access and modification of relevant pages how guest Operating will get benefit from this setting as guest Os has no access to EPT. In my understanding A/D bits are used by memory manager of the OS for optimization and swapping algorithms and there is no role of these bits in page walker.
I(being programmer of VMM) have to add code in VMM to search the relevant entry in GPA space and mark the bits accordingly?
If this is the case then how can we say that these bits are set with out the knowledge of VMM?
kvm way of dealing this issue will be a good answer also
In general, the guest OS would not benefit from the access and dirty bits in the EPT from being set. As you stated the guest does not typically have access to the EPT. This is purely for the hypervisor/VMM. It is analogous to the dirty and access bit in a process page table, the process does not use it, only the OS.
With regard to your second question, it is a bit unclear so I'm not sure what you are asking. However, the hardware will mark the access and dirty bits assuming it has been set up correctly, you do not have to do it manually.

Determine limiting factor of OpenCL workgroup size?

I am trying to run some OpenCL kernels written for desktop graphics cards on an embedded GPU with less resources. In particular, the desktop version assumes a work group size of at least 256 is always supported, but the Mali T628 ARM-based GPU only guarantees 64+ work group size.
Indeed, some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64, and I can't figure out why. I checked the CL_KERNEL_LOCAL_MEM_SIZE for the kernels in question and it is <2 KiB, whereas the CL_DEVICE_LOCAL_MEM_SIZE is 32 KiB, so I think I can rule out __local storage.
What other factors (eg, registers/__private memory?) contribute to low CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage? I am open to both programmatic introspection (such as clGetKernelWorkGroupInfo() which I have already done some), and any development tools I may not know about.
EDIT:
The kernels are part of the OpenCL v2.4 module of OpenCV. In particular, the kernel icvCalcOrientation in surf.cl. The code is fairly complex, and there are several compile-time parameters set, so that's why it is a bit infeasible to manually analyze the kernel for the issue without some hint of what to look at.
If there is a way to troubleshoot this on NVidia or AMD hardware (which I have access to), I am open to it.
EDIT
Since my previous answer was plainly wrong, I need more info on the problem.
By saying "some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64" you're implying that kernels exist where a larger work-group size is available. Is that the case? If not then the answer unfortunatlely is that the device is simply not capable of supporting more than 64 work-items.
Could you please query all available infos from the device in the kernel after setting all kernel agruments and before executing the kernel. The parameters (mostly taken from (Source) ) to query are
CL_DEVICE_GLOBAL_MEM_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_MEM_ALLOC_SIZE
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_KERNEL_WORK_GROUP_SIZE
CL_KERNEL_LOCAL_MEM_SIZE
CL_KERNEL_PRIVATE_MEM_SIZE
There might be more, but currently none come to mind.
General information:
A workgroup size can be limited because the local memory is limited. And this limit can be reached if you have a kernel that uses lots of private memory (“lots” is a relative term – on weaker hardware this may be reached even with seemingly few variables). "However this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. [...]" (Source).
So some of this private memory may be swapped to local memory without you realizing it so the accumulated size of local memory used and the one needed for swapped private memory is bigger than the available local memory size.
CL_DEVICE_LOCAL_MEM_SIZE returns the available size of local memory, CL_KERNEL_LOCAL_MEM_SIZE tells you how much local memory you have used. Aparently this also takes dynamic local memory into consideration by looking at clSetKernelArg, however I am unsure how this is supposed to work if you query CL_KERNEL_LOCAL_MEM_SIZE before setting the kernel argument (which is what you would want to do in order to determine the size of local memory...)
Anyway, OpenCL knows exactly how much local memory you use, so it can calculate how many work-items (each of which has private memory that may need swapping to local memory) it can support. This reduced local working size may be what you get when querying CL_KERNEL_WORK_GROUP_SIZE.
After looking at the kernel you posted I don't think that local memory is the problem here (which is what you already suspected), especially since you only use 2 of the 32 KiB of local memory.
What other factors (eg, registers/__private memory?) contribute to low
CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage?
On Mali all memory used by compute workloads is global (i.e. backed my system RAM), so that memory pressure shouldn't cause any problems except through secondary effects (such as cache thrashing). I suspect register allocation constraints may come into play here - larger workgroups mean more concurrent threads active in the shader core, which means higher pressure on the register file - although I don't know for sure.
The Mali offline compiler for OpenGL ES reports work register usage - for example it can report this type of information:
./malisc -c Mali-T760 -r r1p0 -d Mali-T600_r5p0-00rel0 --fragment -V test.frag
ARM Mali Offline Compiler v4.5.0
(C) Copyright 2007-2014 ARM Limited.
All rights reserved.
1 work registers used, 0 uniform registers used, spilling not used.
A L/S T Total Bound
Cycles: 2 0 0 2 A
Shortest Path: 1 0 0 1 A
Longest Path: 1 0 0 1 A
Note: The cycles counts do not include possible stalls due to cache misses.
I'm not sure if ARM have an offline compiler for OpenCL which can report similar information - it might be worth asking over on the ARM Connected Community site.

Resources