find maximum allowed ibv_reg_mr - mpi

I'm trying to diagnose a memory allocation error thrown by ibv_reg_mr() in software that I use, and my suspicion is that it's related to known problems with some Mellanox Infiniband cards where the default maximum memory that can be registered is about 2GB (see FAQ #18 here http://www.open-mpi.org/faq/?category=openfabrics ).
I would like to be able to confirm unequivocally whether this is the case or not so I can quickly negotiate a solution with my system administrators. Being unfamiliar with RDMA and Infiniband, would someone possibly be able to suggest either (a) a simple program that could register arbitrary amounts of memory such that I may trigger the error at the maximum allowed value, or (b) suggest a way that I may determine the way Infiniband is currently configured considering that I do not have root access?
Thanks everyone!
Jason

You can read the parameters for the Mellanox InfiniBand HCA drivers from sysfs and you don't need root access to do so. The parameters for module <modname> are found in /sys/module/<modname>/parameters/. Each parameter is exposed as a text pseudofile there and its value can be read by simply reading the content of the file. You can even do that using standard Unix command line tools.
For the mlx4_core module the maximum amount of registrable memory is determined using the following formula:
max_reg = (1 << log_num_mtt) * (1 << log_mtts_per_seg) * PAGE_SIZE
For the ib_mthca module the formula is:
max_reg = (num_mtt - fmr_reserved_mtts) * (1 << log_mtts_per_seg) * PAGE_SIZE
where:
num_mtt is the maximum number of memory translation table (MTT) segments per HCA;
log_num_mtt is the binary logarithm of num_mtt;
fmr_reserved_mtts is the number of MTT segments, reserved for FMR;
log_mtts_per_seg is the binary logarithm of the number of MTT entries per segment.
PAGE_SIZE is the system page size, usually 4 KiB on most current platforms.
Each of these parameters (except PAGE_SIZE) can be read from its corresponding module directory in sysfs.
It is possible that both modules are loaded. In this case just do what Open MPI does: look for mlx4_core first and ib_mthca second.

Related

OpenvSwitch port missing in large load, long poll interval observed

ISSUE description
I have a OpenStack system with HA management network (VIP) via ovs (Open vSwitch) port, it's found in this system, with high load (concurrently volume-from-glance-image creation), the VIP port (an ovs port) will be missing.
Analysis
For now, with default log level from log file, the only thing observed is as below the Unreasonably long 62741ms poll interval.
2017-12-29T16:40:38.611Z|00001|timeval(revalidator70)|WARN|Unreasonably long 62741ms poll interval (0ms user, 0ms system)
Idea for now
I will turn debug log on for file and try reproducing the issue:
sudo ovs-appctl vlog/set file:dbg
Question
What else should I do during/after of the issue reproduction please?
Is this issue typical? Caused by what if yes?
I googled OpenvSwitch trouble shoot or other related key words while information was all on data flow/table level instead of this ovs-vswitchd level ( am I right? )
Many thanks!
BR//Wey
This issue was not reproduced and thus I forgot about it, until recently, 2 years afterward, I had a chance to get in touch with this issue in a different environment, and this time I have more ideas on its root cause.
It could be caused by the shift that comes in the bonding, for some reason, the traffic pattern fits the situation of triggering shifts again and again(the condition is quite strong I would say but there is a chance to be hit anyway, right?).
The condition of the shift was quoted as below and please refer to the full doc here: https://docs.openvswitch.org/en/latest/topics/bonding/
Bond Packet Output
When a packet is sent out a bond port, the bond member actually used is selected based on the packet’s source MAC and VLAN tag (see bond_choose_output_member()). In particular, the source MAC and VLAN tag are hashed into one of 256 values, and that value is looked up in a hash table (the “bond hash”) kept in the bond_hash member of struct port. The hash table entry identifies a bond member. If no bond member has yet been chosen for that hash table entry, vswitchd chooses one arbitrarily.
Every 10 seconds, vswitchd rebalances the bond members (see bond_rebalance()). To rebalance, vswitchd examines the statistics for the number of bytes transmitted by each member over approximately the past minute, with data sent more recently weighted more heavily than data sent less recently. It considers each of the members in order from most-loaded to least-loaded. If highly loaded member H is significantly more heavily loaded than the least-loaded member L, and member H carries at least two hashes, then vswitchd shifts one of H’s hashes to L. However, vswitchd will only shift a hash from H to L if it will decrease the ratio of the load between H and L by at least 0.1.
Currently, “significantly more loaded” means that H must carry at least 1 Mbps more traffic, and that traffic must be at least 3% greater than L’s.

_mm512_storenr_pd and _mm512_storenrngo_pd

What is the difference between _mm512_storenrngo_pd and _mm512_storenr_pd?
_mm512_storenr_pd(void * mt, __m512d v):
Stores packed double-precision (64-bit) floating-point elements from v
to memory address mt with a no-read hint to the processor.
It is not clear to me, what no-read hint means. Does it mean, that it is a non-cache coherent write. Does it mean, that a reuse is more expensive or not coherent?
_mm512_storenrngo_pd(void * mt, __m512d v):
Stores packed double-precision (64-bit) floating-point elements from v
to memory address mt with a no-read hint and using a weakly-ordered
memory consistency model (stores performed with this function are not
globally ordered, and subsequent stores from the same thread can be
observed before them).
Basically the same as storenr_pd, but since it uses a weak consistency model, this means that a process can view its own writes before any other processor. But the access of another processor is non-coherent or more expensive?
Quote from Intel® Xeon Phi™ Coprocessor Vector Microarchitecture:
In general, in order to write to a cache line, the Xeon Phi™ coprocessor needs to read in a cache line before writing to it. This is known as read for ownership (RFO). One problem with this implementation is that the written data is not reused; we unnecessarily take up the BW for reading non-temporal data. The Intel® Xeon Phi™ coprocessor supports instructions that do not read in data if the data is a streaming store. These instructions, VMOVNRAP*, VMOVNRNGOAP* allow one to indicate that the data needs to be written without reading the data first. In the Xeon Phi ISA the VMOVNRAPS/VMOVNRPD instructions are able to optimize the memory BW in case of a cache miss by not going through the unnecessary read step.
The VMOVNRNGOAP* instructions are useful when the programmer tolerates weak write-ordering of the application data―that is, the stores performed by these instructions are not globally ordered. This means that the subsequent write by the same thread can be observed before the VMOVNRNGOAP instructions are executed. A memory-fencing operation should be used in conjunction with this operation if multiple threads are reading and writing to the same location.
It seems that "No-read hints", "Streaming store", and "Non-temporal Stream/Store" are used interchangeably in several resources.
So yes it is non-cache coherent write, though with Knights Corner (KNC, where both vmovnrap* and vmovnrngoap* belong) the stores happen to L2 cache, it does not bypass all levels of cache.
As explained in above quote, vmovnrngoap* is special from vmovnrap* that weakly-ordered memory consistency model allows "subsequent write by the same thread can be observed before the VMOVNRNGOAP instructions are executed", so yes the access of another thread or processor is non-coherent, and a fencing operation should be used. Though CPUID can be used as the fencing operation, better options are "LOCK ADD [RSP],0" (a dummy atomic add) or XCHG (which combines a store and a fence).
A few more details:
On KNC if you use compiler switch (-opt-streaming-stores always) or pragma (#pragma vector nontemporal), the default generated code will be VMOVNRNGOAP* starting with Composer XE 2013 Update 1;
More quotes from COMPILER-BASED MEMORY OPTIMIZATIONS FOR HIGH PERFORMANCE COMPUTING SYSTEMS
NR Stores.The NR store instruction (vmovnr) is a standard vector store instruction that can always be used safely. An NR store instruction that misses in the local cache causes all potential copies of the cache line in remote caches to be invalidated, the cache line to be allocated (but not initialized) at the local cache in exclusive state, and the write-data in the instruction to be written to the cacheline. There is no data transfer from main memory which is what saves memory bandwidth. An NR store instruction and other load and/or store instructions from the same thread are globally ordered, which means that all observers of this sequence of instructions always see the same fixed execution order.
The NR.NGO (non-globally ordered) store instruction(vmovnrngo) relaxes the global ordering constraint of the NR store instruction.This relaxation makes the NR.NGO instruction have a lower latency than the NRinstruction, which can be used to achieve higher performance in streaming storeintensive applications. However, removing this restriction means that an NR.NGO store instruction and other load and/or store instructions from the same thread can be observed by two observers to have two different orderings. The use of NR.NGO store instructions is safe only when reordering the order of these instructions is verified not to change the outcome. Otherwise, using NR.NGO stores may lead to incorrect execution. Our compiler can generate NR.NGO store instructions for store instructions that it identifies to have non-temporal behavior. For instance, a parallel loop that is detected to be non-temporal by our compiler can make use of NR.NGO instructions. At the end of such a loop, to ensure all outstanding non-globally ordered stores are completed and all threads have a consistent view of memory, our compiler generates a fence (a lock instruction) after the loop. This fence is needed before continuing execution of the subsequent code fragment to ensure all threads have exactly the same view of memory.
A general rule of thumb is that non-temporal store benefit memory access blocks that are not reused in the immediate future. So that yes reuse will be expensive in both cases.

Convert local memory to registers

I currently have a kernel that processes a global buffer by reading
into local memory and doing calculations. Now, I would like to use registers
instead of local memory. How do I convert to registers?
Thanks!
Edit: project can be found here:
https://github.com/boxerab/ocldwt
Without seeing some code, it's impossible to give much more guidance than has already been given but I will try to elaborate on the comments.
Any variable declared without __local or __global is private, so if you remove the modifier the memory will only be visible to the single processing element running the work item. This will likely be stored in a register, although that will only happen if there is register space available. The compiler will already be putting some values into registers on your behalf, even if you haven't asked it to do so. You can see evidence of this if, for example, you are running on the NVIDIA platform and pass the -cl-nv-verbose flag when you build your kernels. You will see output like this:
ptxas info : Compiling entry function 'monte' for 'sm_35'
ptxas info : Function properties for monte
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 61 registers, 360 bytes cmem[0], 96 bytes cmem[2]
indicating that 61 registers are in use.
However, as pointed out by #DarkZeros, the decision to move from local memory to private memory is much more about the scope of variables. If your algorithm depends on all of the members of a compute unit having access to the same copy of a variable, then it isn't going to work any more.

Size of intel x86 Segment registers and GDT(LDT) Register

I'm a beginner level of student in the system architecture, to be precisely intel x86.
Currently I'm reading Intel's manual (1,3a,3b,3c) and I'm stucked in segmentation part.
As far as I know, in the Protected mode, system is translating a logical memory to the linear memory ( or physical memory )
and a "far pointer" is pointing an actual linear (or physical ) memory address with 2 different parts,
a segment selector and an offset.
As I learned from university, each segment registers has 16 bits portion of data,
According to Intel's manual, 16bits are only the visible part of segment register,
but there is more hiddnen part of segment register which unable to program or access by user.
Is any chance that I could know an actual size of segment register?
Second question is about LDT , GDT , IDT register for protect mode.
Are those register (LDTr,GDTr,IDTr) an actual register in cpu chipset?
If it is, is any chance to access those table after boot sequence ( prevelige ring 3, user mode )?
Thank you for read my question.
PS. I tried to google it , and I couldn't find any answer.
That's why I'm spending my time to write this question.
The segment registers are 16 bits. The segment descriptors that the segment registers refer to are larger. The confusing thing is that all i386 and later processors have a small non-coherent cache of segment descriptors that correspond to the segment registers (one cached descriptor for each segment register) that is sometimes referred to as the hidden part of the segment register. It's not really part of the register, though each entry in the cache is closely associated with a specific segment register. The cache is tightly tied to the segment registers, in that whenever a segment register is written to, the corresponding cache element is updated (re-read from memory), and instructions that use a segment register use the cached descriptor corresponding to that register rather than reading the descriptor from memory.
x86 segment registers are all 16-bit. I'm not aware of any "hidden" segment register portion. If you were unable to find anything on this via a Google search, I doubt that it exists.
For a good description of the Local Descriptor Table (LDT), the Global Descriptor Table (GDT), and the Interrupt Descriptor Table (IDT), there is a good description on Wikipedia: http://en.wikipedia.org/wiki/Intel_8086.

Why is there a CL_DEVICE_MAX_WORK_GROUP_SIZE?

I'm trying to understand the architecture of OpenCL devices such as GPUs, and I fail to see why there is an explicit bound on the number of work items in a local work group, i.e. the constant CL_DEVICE_MAX_WORK_GROUP_SIZE.
It seems to me that this should be taken care of by the compiler, i.e. if a (one-dimensional for simplicity) kernel is executed with local workgroup size 500 while its physical maximum is 100, and the kernel looks for example like this:
__kernel void test(float* input) {
i = get_global_id(0);
someCode(i);
barrier();
moreCode(i);
barrier();
finalCode(i);
}
then it could be converted automatically to an execution with work group size 100 on this kernel:
__kernel void test(float* input) {
i = get_global_id(0);
someCode(5*i);
someCode(5*i+1);
someCode(5*i+2);
someCode(5*i+3);
someCode(5*i+4);
barrier();
moreCode(5*i);
moreCode(5*i+1);
moreCode(5*i+2);
moreCode(5*i+3);
moreCode(5*i+4);
barrier();
finalCode(5*i);
finalCode(5*i+1);
finalCode(5*i+2);
finalCode(5*i+3);
finalCode(5*i+4);
}
However, it seems that this is not done by default. Why not? Is there a way to make this process automated (other than writing a pre-compiler for it myself)? Or is there an intrinsic problem which can make my method fail on certain examples (and can you give me one)?
I think that the origin of the CL_DEVICE_MAX_WORK_GROUP_SIZE lies in the underlying hardware implementation.
Multiple threads are running simultaneously on computing units and every one of them needs to keep state (for call, jmp, etc). Most implementations use a stack for this and if you look at the AMD Evergreen family their is an hardware limit for the number of stack entries that are available (every stack entry has subentries). Which in essence limits the number of threads every computing unit can handle simultaneously.
As for the compiler can do this to make it possible. It could work but understand that it would mean to recompile the kernel over again. Which isn't always possible. I can imagine situations where developers dump the compiled kernel for each platform in a binary format and ships it with their software just for "not so open-source" reasons.
Those constants are queried from the device by the compiler in order to determine a suitable work group size at compile-time (where compiling of course refers to compiling the kernel). I might be getting you wrong, but it seems you're thinking of setting those values by yourself, which wouldn't be the case.
The responsibility is within your code to query the system capabilities to be prepared for whatever hardware it will run on.

Resources