What is the need of a temporary register for Arithmetic operations in 8085 microprocessor? - cpu-registers

I know the two inputs of ALU are accumulator and the temporary register.I was bit curious that why we store intermediate result in a temporary register and why not the data bus is directly connected to the ALU.What is the need of that temporary register in the 8085 microprocessor ?

One of the uses of a temporary register is to stabilize the timing of the data flow.
Consider the example you hint at in your question: An ALU operation with the Accumulator and a memory operand.
Yes, the memory operand is read from memory. So why not feed this directly into the ALU?
One possible answer is timing.
The data being read from memory is only valid at the end of the bus cycle. This data goes away when the bus cycle ends. This means that the data is only known to be valid for a very brief period of time.
In all likelihood, too short a period of time to propagate through the ALU and be stored in the Accumulator.
However, more than enough time to simply be stored directly into a temporary register. The ALU then has two stable inputs: The Accumulator and the Temporary register.
In effect, the temporary register holds the result of the memory operation for use by the ALU in the next processor cycle.

Related

Why Motorola 68k's 32-bit general-purpose registers are divided into data registers and address registers?

The 68k registers are divided into two groups of eight. Eight data registers (D0 to D7) and eight address registers (A0 to A7). What is the purpose of this separation, would not be better if united?
The short answer is, this separation comes from the architecture limitations and design decisions made at the time.
The long answer:
The M68K implements quite a lot of addressing modes (especially when compared with the RISC-based processors), with many of its instructions supporting most (if not all) of them. This gives a large variety of addressing modes combinations within every instruction.
This also adds a complexity in terms of opcode execution. Take the following example:
move.l $10(pc), -$20(a0,d0.l)
The instruction is just to copy a long-word from one location to another, simple enough. But in order to actually perform the operation, the processor needs to figure out the actual (raw) memory addresses to work with for both source and destination operands. This process, in which operands addressing modes are decoded (resolved), is called the effective address calculation.
For this example:
In order to calculate the source effective address - $10(pc),
the processor loads the value of PC (program) counter register
and adds $10 to it.
In order to calculate the destination effective address -
-$20(a0,d0.l), the processor loads the value of A0 register, adds the value of D0 register to it, then subtracts
$20.
This is quite a lot of calculations of a single opcode, isn't it?
But the M68K is quite fast in performing these calculations. In order to calculate effective addresses quickly, it implements a dedicated Address Unit (AU).
As a general rule, operations on data registers are handled by the ALU (Arithmetic Logical Unit) and operations involving address calculations are handled by the AU (Address Unit).
The AU is well optimized for 32-bit address operations: it performs 32-bit subtraction/addition within one bus cycle (4 CPU ticks), which ALU doesn't (it takes 2 bus cycles for 32-bit operations).
However, the AU is limited to just load and basic addition/subtraction operations (as dictated by the addressing modes), and it's not connected to the CCR (Conditional Codes Register), which is why operations on address registers never update flags.
That said, the AU should've been there to optimize calculation of complex addressing modes, but it just couldn't replace the ALU completely (after all, there were only about 68K transistors in the M68K), hence there are two registers set (data and address registers) each having their own dedicated unit.
So this is just based on a quick lookup, but using 16 registers is obviously easier to program. The problem could be that you would then have to make instructions for each of the 16 registers. Which would double the number of opcodes needed. Using half for each purpose is not ideal but gives access to more registers in general.

_mm512_storenr_pd and _mm512_storenrngo_pd

What is the difference between _mm512_storenrngo_pd and _mm512_storenr_pd?
_mm512_storenr_pd(void * mt, __m512d v):
Stores packed double-precision (64-bit) floating-point elements from v
to memory address mt with a no-read hint to the processor.
It is not clear to me, what no-read hint means. Does it mean, that it is a non-cache coherent write. Does it mean, that a reuse is more expensive or not coherent?
_mm512_storenrngo_pd(void * mt, __m512d v):
Stores packed double-precision (64-bit) floating-point elements from v
to memory address mt with a no-read hint and using a weakly-ordered
memory consistency model (stores performed with this function are not
globally ordered, and subsequent stores from the same thread can be
observed before them).
Basically the same as storenr_pd, but since it uses a weak consistency model, this means that a process can view its own writes before any other processor. But the access of another processor is non-coherent or more expensive?
Quote from Intel® Xeon Phi™ Coprocessor Vector Microarchitecture:
In general, in order to write to a cache line, the Xeon Phi™ coprocessor needs to read in a cache line before writing to it. This is known as read for ownership (RFO). One problem with this implementation is that the written data is not reused; we unnecessarily take up the BW for reading non-temporal data. The Intel® Xeon Phi™ coprocessor supports instructions that do not read in data if the data is a streaming store. These instructions, VMOVNRAP*, VMOVNRNGOAP* allow one to indicate that the data needs to be written without reading the data first. In the Xeon Phi ISA the VMOVNRAPS/VMOVNRPD instructions are able to optimize the memory BW in case of a cache miss by not going through the unnecessary read step.
The VMOVNRNGOAP* instructions are useful when the programmer tolerates weak write-ordering of the application data―that is, the stores performed by these instructions are not globally ordered. This means that the subsequent write by the same thread can be observed before the VMOVNRNGOAP instructions are executed. A memory-fencing operation should be used in conjunction with this operation if multiple threads are reading and writing to the same location.
It seems that "No-read hints", "Streaming store", and "Non-temporal Stream/Store" are used interchangeably in several resources.
So yes it is non-cache coherent write, though with Knights Corner (KNC, where both vmovnrap* and vmovnrngoap* belong) the stores happen to L2 cache, it does not bypass all levels of cache.
As explained in above quote, vmovnrngoap* is special from vmovnrap* that weakly-ordered memory consistency model allows "subsequent write by the same thread can be observed before the VMOVNRNGOAP instructions are executed", so yes the access of another thread or processor is non-coherent, and a fencing operation should be used. Though CPUID can be used as the fencing operation, better options are "LOCK ADD [RSP],0" (a dummy atomic add) or XCHG (which combines a store and a fence).
A few more details:
On KNC if you use compiler switch (-opt-streaming-stores always) or pragma (#pragma vector nontemporal), the default generated code will be VMOVNRNGOAP* starting with Composer XE 2013 Update 1;
More quotes from COMPILER-BASED MEMORY OPTIMIZATIONS FOR HIGH PERFORMANCE COMPUTING SYSTEMS
NR Stores.The NR store instruction (vmovnr) is a standard vector store instruction that can always be used safely. An NR store instruction that misses in the local cache causes all potential copies of the cache line in remote caches to be invalidated, the cache line to be allocated (but not initialized) at the local cache in exclusive state, and the write-data in the instruction to be written to the cacheline. There is no data transfer from main memory which is what saves memory bandwidth. An NR store instruction and other load and/or store instructions from the same thread are globally ordered, which means that all observers of this sequence of instructions always see the same fixed execution order.
The NR.NGO (non-globally ordered) store instruction(vmovnrngo) relaxes the global ordering constraint of the NR store instruction.This relaxation makes the NR.NGO instruction have a lower latency than the NRinstruction, which can be used to achieve higher performance in streaming storeintensive applications. However, removing this restriction means that an NR.NGO store instruction and other load and/or store instructions from the same thread can be observed by two observers to have two different orderings. The use of NR.NGO store instructions is safe only when reordering the order of these instructions is verified not to change the outcome. Otherwise, using NR.NGO stores may lead to incorrect execution. Our compiler can generate NR.NGO store instructions for store instructions that it identifies to have non-temporal behavior. For instance, a parallel loop that is detected to be non-temporal by our compiler can make use of NR.NGO instructions. At the end of such a loop, to ensure all outstanding non-globally ordered stores are completed and all threads have a consistent view of memory, our compiler generates a fence (a lock instruction) after the loop. This fence is needed before continuing execution of the subsequent code fragment to ensure all threads have exactly the same view of memory.
A general rule of thumb is that non-temporal store benefit memory access blocks that are not reused in the immediate future. So that yes reuse will be expensive in both cases.

Winsock asynchronous multiple WSASend with one single buffer

MSDN states "For a Winsock application, once the WSASend function is called, the system owns these buffers and the application may not access them."
In a server application, does that mean that if I want to broadcast a message to multiple clients I cannot use a single buffer that holds the data and invoke WSASend on each socket with that one buffer?
I don't have a documentation reference that confirms this is possible but I've been doing it for years and it hasn't failed yet, YMMV.
You CAN use a single data buffer as long as you have a unique OVERLAPPED structure per send. Since the WSABUF array is duplicated by the WSASend() call and can be stack based I would expect that you COULD have a single WSABUF array, but I've never done that.
What you DO need to make sure that you keep that single data buffer "alive" until all of the data writes complete.
Broadcasting like this can complicate a design if you tend to structure your extended OVERLAPPED so that it includes the data buffer, but it does avoid memory allocation and memory copying.
Note: I have a system whereby my extended OVERLAPPED structures include the data buffer and operation code and these are reference counted and pool and used for sends and recvs. When broadcasting a buffer I use a separate "buffer handle" per send, this handle is just an OVERLAPPED structure extended in a different way, it holds a reference to the original data buffer and has its own reference count. When all of the broadcast sends have completed all of the buffer handles will have been released and these will, in turn, have released the underlying data buffer for reuse.

How malloc() and sbrk() works in unix?

I am new to UNIX, and I am studying some of UNIX system calls such as brk(), sbrk(), and so on....
Last day I have read about malloc() function, and I was confused a little bit!
Can anybody tell me why malloc reduces the number of sbrk() system calls that the program must perform?
And another question, do brk(0), sbrk(0) and malloc(0) return the same value?
Syscalls are expensive to process because of the additional overhead that a syscall places: you have to switch to kernel mode. A system call gets into the kernel by issuing a "trap" or interrupt. It's a call to the kernel for a service, and because it executes in the kernel address space, it has a high overhead switch to kernel (and then switching back).
This is why malloc reduces the number of calls to sbrk() and brk(). It does so by requesting more memory than you asked it to, so that it doesn't have to issue a syscall everytime you need more memory.
brk() and sbrk() are different.
brk is used to set the end of the data segment to the value you specify. It says "set the end of my data segment to this address". Of course, the address you specify must be reasonable, the operating system must have enough memory, and you can't make it point to somewhere that would otherwise exceed the process maximum data size. Thus, brk(0) is invalid, since you'd be trying to set the end of the data segment to address 0, which is nonsense.
On the other hand, sbrk increments the data segment size by the amount you specify, and returns a pointer to the previous break value. Calling sbrk with 0 is valid; it is a way to get a pointer to the current data segment break address.
malloc is not a system call, it's a C library function that manages memory using sbrk. According to the manpage, malloc(0) is valid, but not of much use:
If size is 0, then malloc() returns either NULL, or a unique pointer
value that can later be successfully passed to free().
So, no, brk(0), sbrk(0) and malloc(0) are not equivalent: the first of them is invalid, the second is used to obtain the address of the program's break, and the latter is useless.
Keep in mind that you should never use both malloc and brk or sbrk throughout your program. malloc assumes it's got full control of brk and sbrk, if you interchange calls to malloc and brk, very weird things can happen.
why malloc reduces the number of sbrk() system calls that the program
must perform?
say, if you call malloc() to request 10 bytes memory, the implementation may use sbrk (or other system call like mmap) to request 4K bytes from OS. Then when you call malloc() next time to request another 10 bytes, it doesn't have to issue system call; it may just return some memory allocated by system call of the last time 4K.
malloc() function is used to call the sbrk system call to create a memory dynamically during the process.
malloc() function is already assigned in stdlib.h header file so the as per the required function is recursively call by the malloc function using the library function.
with the help of sbrk we need to explicitly declare some thing to call the system call.
According to the size given in function or through system call it return to the variable and store.
sbrk() function increases the programs data segment allocation by specified bytes.
malloc(4096); // sbrk += 4096 Bytes
free(); // freeing memory will not bring down the sbrk by 4096 Bytes
malloc(4096); // malloc'ing again will not increase the sbrk and it will use
the existing space which not result in sbrk() call.

Memory test operation without pointers in NXC on NXT?

I'm trying to write a memory test program for the NXT, since I have several with burned memory cells and would like to identify which NXTs are unusable. This program is intended to test each byte in memory for integrity by:
Allocating 64 bits to an Linear Feedback Shift Register randomizer
Adding another byte to a memory pointer
Writing random data to the selected memory cell
Verifying the data is read back correctly
However, I then discovered through these attempts that the NXT doesn't actually support pointer operations. Thus, I can't simply iterate the pointer byte and read its location to test.
How do I go about iterating over indexes in memory without pointers?
I think the problem is that you don't really get direct memory access in either NBC/NXC or RobotC.
From what I know, both run on an NXT firmware emulator; so the bad memory address[es] might change from your program's point of view (assuming the emulator does virtual memory).
To actual run bare metal, I would suggest using the NXTBINARY function of John Hansen's modified firmware as described here:
http://www.tau.ac.il/~stoledo/lego/nxt-native/
The enhanced fimware can be found at:
http://bricxcc.sourceforge.net/test_releases/

Resources