What is the difference between _mm512_storenrngo_pd and _mm512_storenr_pd?
_mm512_storenr_pd(void * mt, __m512d v):
Stores packed double-precision (64-bit) floating-point elements from v
to memory address mt with a no-read hint to the processor.
It is not clear to me, what no-read hint means. Does it mean, that it is a non-cache coherent write. Does it mean, that a reuse is more expensive or not coherent?
_mm512_storenrngo_pd(void * mt, __m512d v):
Stores packed double-precision (64-bit) floating-point elements from v
to memory address mt with a no-read hint and using a weakly-ordered
memory consistency model (stores performed with this function are not
globally ordered, and subsequent stores from the same thread can be
observed before them).
Basically the same as storenr_pd, but since it uses a weak consistency model, this means that a process can view its own writes before any other processor. But the access of another processor is non-coherent or more expensive?
Quote from Intel® Xeon Phi™ Coprocessor Vector Microarchitecture:
In general, in order to write to a cache line, the Xeon Phi™ coprocessor needs to read in a cache line before writing to it. This is known as read for ownership (RFO). One problem with this implementation is that the written data is not reused; we unnecessarily take up the BW for reading non-temporal data. The Intel® Xeon Phi™ coprocessor supports instructions that do not read in data if the data is a streaming store. These instructions, VMOVNRAP*, VMOVNRNGOAP* allow one to indicate that the data needs to be written without reading the data first. In the Xeon Phi ISA the VMOVNRAPS/VMOVNRPD instructions are able to optimize the memory BW in case of a cache miss by not going through the unnecessary read step.
The VMOVNRNGOAP* instructions are useful when the programmer tolerates weak write-ordering of the application data―that is, the stores performed by these instructions are not globally ordered. This means that the subsequent write by the same thread can be observed before the VMOVNRNGOAP instructions are executed. A memory-fencing operation should be used in conjunction with this operation if multiple threads are reading and writing to the same location.
It seems that "No-read hints", "Streaming store", and "Non-temporal Stream/Store" are used interchangeably in several resources.
So yes it is non-cache coherent write, though with Knights Corner (KNC, where both vmovnrap* and vmovnrngoap* belong) the stores happen to L2 cache, it does not bypass all levels of cache.
As explained in above quote, vmovnrngoap* is special from vmovnrap* that weakly-ordered memory consistency model allows "subsequent write by the same thread can be observed before the VMOVNRNGOAP instructions are executed", so yes the access of another thread or processor is non-coherent, and a fencing operation should be used. Though CPUID can be used as the fencing operation, better options are "LOCK ADD [RSP],0" (a dummy atomic add) or XCHG (which combines a store and a fence).
A few more details:
On KNC if you use compiler switch (-opt-streaming-stores always) or pragma (#pragma vector nontemporal), the default generated code will be VMOVNRNGOAP* starting with Composer XE 2013 Update 1;
More quotes from COMPILER-BASED MEMORY OPTIMIZATIONS FOR HIGH PERFORMANCE COMPUTING SYSTEMS
NR Stores.The NR store instruction (vmovnr) is a standard vector store instruction that can always be used safely. An NR store instruction that misses in the local cache causes all potential copies of the cache line in remote caches to be invalidated, the cache line to be allocated (but not initialized) at the local cache in exclusive state, and the write-data in the instruction to be written to the cacheline. There is no data transfer from main memory which is what saves memory bandwidth. An NR store instruction and other load and/or store instructions from the same thread are globally ordered, which means that all observers of this sequence of instructions always see the same fixed execution order.
The NR.NGO (non-globally ordered) store instruction(vmovnrngo) relaxes the global ordering constraint of the NR store instruction.This relaxation makes the NR.NGO instruction have a lower latency than the NRinstruction, which can be used to achieve higher performance in streaming storeintensive applications. However, removing this restriction means that an NR.NGO store instruction and other load and/or store instructions from the same thread can be observed by two observers to have two different orderings. The use of NR.NGO store instructions is safe only when reordering the order of these instructions is verified not to change the outcome. Otherwise, using NR.NGO stores may lead to incorrect execution. Our compiler can generate NR.NGO store instructions for store instructions that it identifies to have non-temporal behavior. For instance, a parallel loop that is detected to be non-temporal by our compiler can make use of NR.NGO instructions. At the end of such a loop, to ensure all outstanding non-globally ordered stores are completed and all threads have a consistent view of memory, our compiler generates a fence (a lock instruction) after the loop. This fence is needed before continuing execution of the subsequent code fragment to ensure all threads have exactly the same view of memory.
A general rule of thumb is that non-temporal store benefit memory access blocks that are not reused in the immediate future. So that yes reuse will be expensive in both cases.
Related
I am learning cuda, but currently don't access to a cuda device yet and am curious about some unified memory behaviour. As far as i understood, the unified memory functionality, transfers data from host to device on a need to know basis. So if the cpu calls some data 100 times, that is on the gpu, it transfers the data only during the first attempt and clears that memory space on the gpu. (is my interpretation correct so far?)
1 Assuming this, is there some behaviour that, if the programmatic structure meant to fit on the gpu is too large for the device memory, will the UM exchange some recently accessed data structures to make space for the next ones needed to complete to computation or does this still have to be achieved manually?
2 Additionally I would be grateful if you could clarify something else related to the memory transfer behaviour. It seems obvious that data would be transferred back on fro upon access of the actual data, but what about accessing the pointer? for example if I had 2 arrays of the same UM pointers (the data in the pointer is currently on the gpu and the following code is executed from the cpu) and were to slice the first array, maybe to delete an element, would the iterating step over the pointers being placed into a new array so access the data to do a cudamem transfer? surely not.
As far as i understood, the unified memory functionality, transfers data from host to device on a need to know basis. So if the cpu calls some data 100 times, that is on the gpu, it transfers the data only during the first attempt and clears that memory space on the gpu. (is my interpretation correct so far?)
The first part is correct: when the CPU tries to access a page that resides in device memory, it is transferred in main memory transparently. What happens to the page in device memory is probably an implementation detail, but I imagine it may not be cleared. After all, its contents only need to be refreshed if the CPU writes to the page and if it is accessed by the device again. Better ask someone from NVIDIA, I suppose.
Assuming this, is there some behaviour that, if the programmatic structure meant to fit on the gpu is too large for the device memory, will the UM exchange some recently accessed data structures to make space for the next ones needed to complete to computation or does this still have to be achieved manually?
Before CUDA 8, no, you could not allocate more (oversubscribe) than what could fit on the device. Since CUDA 8, it is possible: pages are faulted in and out of device memory (probably using an LRU policy, but I am not sure whether that is specified anywhere), which allows one to process datasets that would not otherwise fit on the device and require manual streaming.
It seems obvious that data would be transferred back on fro upon access of the actual data, but what about accessing the pointer?
It works exactly the same. It makes no difference whether you're dereferencing the pointer that was returned by cudaMalloc (or even malloc), or some pointer within that data. The driver handles it identically.
I want to implement some file io with the routines provided by MPI (in particular Open MPI).
Due to possible limitations of the environment, I wondered, if it is possible to limit the nodes, which are responsible for IO, so that all other nodes are required to perform a hidden mpi_send to this group of processes, to actually write the data. This would be nice in cases, where e.g. the master node is placed on a node with high-performance filesystem and the other nodes have only access to a low-performance filesystem, where the binaries are stored.
Actually, I already found some information, which might be helpful, but I couldn't find further information, how to actually implement these things:
1: There is an info key MPI_IO belonging to the communicator, which tells which ranks provide standard-conforming IO-routines. As this is listed as an environmental inquiry, I don't see, where I could modify this.
2: There is an info key io_nodes_list which seems to belong to file-related info-objects. Unfortunately, the possible values for this key are not documented and Open MPI doesn't seem to implement them in any way. Actually, I can't even get the filename from the info-object which is returned by mpi_file_get_info...
As a workaround, I could imagine two things: On the one hand, I could perform the IO with standard Fortran routines, or on the other hand, create a new communicator, which is responsible for IO. But in both cases, the processes, which are responsible for IO have to check for possible IO from the other processes to perform manual communication and file interaction.
Is there a nice and automatic way to restrict the IO to certain nodes? If yes, how could I implement this?
You explicitly asked about OpenMPI, but there are two MPI-IO implementations in OpenMPI. The old workhorse is ROMIO, the MPI-IO implementation shared among just about every MPI implementation. OpenMPI also has OMPIO, but I don't know a whole lot about tuning that one.
Next, if you want things to happen automatically for you, you'll have to use collective i/o. The independent I/O routines cannot send a message to anyone else -- they are independent and there's no way to know if the other side will be listening.
With those preliminaries out of the way...
You are asking about "i/o aggregaton". There is a bit of information here in the context of another optimization called "deferred open" (and which OMPIO calls Lazy Open)
https://press3.mcs.anl.gov/romio/2003/08/05/deferred-open/
In short, you can definitely say "only these N processes should do I/O", and then the collective I/O library will exchange data and make sure that happens. The optimization was developed some 15-odd years ago for just the situation you proposed: some nodes being better connected to storage than others (as was the case on the old ASCI Red machine, to give you a sense for how old this optimization is...)
I don't know where you got io_nodes_list. You probably want to use the MPI-IO info keys cb_config_list and cb_nodes
So, you've got a cluster with master1, master2, master3, and compute1, compute2, compute3 (or whatever the hostnames actually are). You can do something like this (in c, sorry. I'm not proficient in Fortran):
MPI_Info info;
MPI_File fh;
MPI_Info_create(&info);
MPI_Info_set(info, "cb_config_list", "master1:1,master2:1,master3:1");
MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE|MPI_MODE_WRONLY, info, &fh)
With these hints, MPI_File_write_all will aggregate all the I/O through the MPI processes on master1, master2, and master3. ROMIO won't blow up your memory because it will chunk up the I/O into a smaller working set (specified with the "cb_buffer_size" hint: cranking this up, if you have the memory, is a good way to get better performance).
There is a ton of information about the hints you can set in the ROMIO users guide:
http://www.mcs.anl.gov/research/projects/romio/doc/users-guide/node6.html
I wasn't able to find an answer to my question anywhere on the web, so I thought stackoverflow would be my best bet! My question simply is, is it possible to establish a computer with no registers? I know registers are temp. data holders and provice the fastest way possible to access data, but what are the consequences to the inexistence of registers in a computer, besides making data transmission a lot slower?
No. You can have a model of computation that doesn't involve registers. In fact, most theoretical models don't.
But as for a CPU, which is an electrical circuit, any kind of persistent state is implemented by a flip-flop, a.k.a. a register. There is no way to feed data into the circuits that perform processing without connecting a register to their inputs.
As for practical models of computation, i.e. instruction set architectures, you can avoid the terminology of calling anything a "register" but you inevitably need to define some means of data storage upon which operations act. Even if you don't, programmers will end up calling such storage locations as registers. Some old machines used the first page of RAM as primary scratch space, therefore the first 256 bytes were dubbed "registers," even if they were DRAM in the electronic sense. (Memory fails me; this could have been before DRAM. There is no difference between SRAM and what is physically called a register.)
I'm trying to write a memory test program for the NXT, since I have several with burned memory cells and would like to identify which NXTs are unusable. This program is intended to test each byte in memory for integrity by:
Allocating 64 bits to an Linear Feedback Shift Register randomizer
Adding another byte to a memory pointer
Writing random data to the selected memory cell
Verifying the data is read back correctly
However, I then discovered through these attempts that the NXT doesn't actually support pointer operations. Thus, I can't simply iterate the pointer byte and read its location to test.
How do I go about iterating over indexes in memory without pointers?
I think the problem is that you don't really get direct memory access in either NBC/NXC or RobotC.
From what I know, both run on an NXT firmware emulator; so the bad memory address[es] might change from your program's point of view (assuming the emulator does virtual memory).
To actual run bare metal, I would suggest using the NXTBINARY function of John Hansen's modified firmware as described here:
http://www.tau.ac.il/~stoledo/lego/nxt-native/
The enhanced fimware can be found at:
http://bricxcc.sourceforge.net/test_releases/
I was reading this excellent article which gives an introduction to Asynchronous programming here http://krondo.com/blog/?p=1209 and I came across the following line which I find hard to understand.
Since there is no actual parallelism(in asnyc), it appears from our diagrams that an asynchronous program will take just as long to execute as a synchronous one, perhaps longer as the asynchronous program might exhibit poorer locality of reference.
Could someone explain how locality of reference comes into picture here?
Locality of reference, like that Wikipedia article mentions, is the observation that when some data is accessed (on disk, in memory, whatever), other data near that location is often accessed as well. This observation makes sense since developers tend to group similar data together. Since the data are related, they're often processed together. Specifically, this is known as spatial locality.
For a weak example, imagine computing the sum of an array or doing a matrix multiplication. The data representing the array or matrix are typically stored in continguous memory locations, and for this example, once you access one specific location in memory, you will be accessing others close to it as well.
Computer architecture takes locality of reference into account. Operating systems have the notion of "pages" which are (roughly) 4KB chunks of data that can be paged in and out individually (moved between physical memory and disk). When you touch some memory that's not resident (not physically in RAM), the OS will bring the entire page of data off disk and into memory. The reason for this is locality: you're likely to touch other data around what you just touched.
Additionally, CPUs have the concept of caches. For example, a CPU might have an L1 (level 1) cache, which is really just a big block of on-CPU data that the CPU can access faster than RAM. If a value is in the L1 cache, the CPU will use that instead of going out to RAM. Following the principle of the locality of reference, when a CPU access some value in main memory, it will bring that value and all values near it into the L1 cache. This set of values is known as a cache line. Cache lines vary in size, but the point is that when you access the first value of an array, the CPU might have to get it from RAM, but subsequent accesses (close in proximity) will be faster since the CPU brought the whole bundle of values into the L1 cache on the first access.
So, to answer your question: if you imagine a synchronous process computing the sum of a very large array, it will touch memory locations in order one after the other. In this case, your locality is good. In the asynchronous case, however, you might have n threads each taking a slice of the array (of size 1/n) and computing the sub-sum. Each thread is touching a potentially very different location in memory (since the array is large) and since each thread can be switched in and out of execution, the actual pattern of data access from the point of view of the OS or CPU is poor. The L1 cache on a CPU is finite, so if Thread 1 brings in a cache line (due to an access), this might evict the cache line of Thread 2. Then, when Thread 2 goes to access its array value, it has to go to RAM, which will bring in its cache line again and potentially evict the cache line of Thread 1, and so on. Depending on the system resources and usage as a whole, this pattern could happen on the OS/page level as well.
The poorer locality of reference results in poorer cache usage -- each time you do a thread switch, you can expect that most of what's in the cache relates to that previous thread, not the current one, so most reads will get data from main memory instead of the cache.
He's ultimately wrong though, at least for quite a few programs. The reason is pretty simple: even though you gain nothing on CPU-bound code, when you can combine some CPU-bound code with some I/O bound code, you can expect an overall speed improvement. You can, for example, initiate a read or write, then switch to doing computation while the disk is busy, then switch back to the I/O bound thread when the disk finishes its work.