Computer with no registers - cpu-registers

I wasn't able to find an answer to my question anywhere on the web, so I thought stackoverflow would be my best bet! My question simply is, is it possible to establish a computer with no registers? I know registers are temp. data holders and provice the fastest way possible to access data, but what are the consequences to the inexistence of registers in a computer, besides making data transmission a lot slower?

No. You can have a model of computation that doesn't involve registers. In fact, most theoretical models don't.
But as for a CPU, which is an electrical circuit, any kind of persistent state is implemented by a flip-flop, a.k.a. a register. There is no way to feed data into the circuits that perform processing without connecting a register to their inputs.
As for practical models of computation, i.e. instruction set architectures, you can avoid the terminology of calling anything a "register" but you inevitably need to define some means of data storage upon which operations act. Even if you don't, programmers will end up calling such storage locations as registers. Some old machines used the first page of RAM as primary scratch space, therefore the first 256 bytes were dubbed "registers," even if they were DRAM in the electronic sense. (Memory fails me; this could have been before DRAM. There is no difference between SRAM and what is physically called a register.)

Related

What is lockstep in Peer-to-Peer gaming?

I am researching about Peer-To-Peer network architecture for games.
What i have read from multiples sources is that Peer-To-Peer model makes it easy for people to hack. Sending incorrect data about your game character, whether it is your wrong position or the amount of health point you have.
Now I have read that one of the things to make Peer-To-Peer more secure is to put an anti-cheat system into your game, which controls some thing like: how fast has someone moved from spot A to spot B, or controls if someones health points did not change drastically without a reason.
I have also read about Lockstep, which is described as a "handshake" between all the clients in Peer-to-Peer network, where clients promise not to do certain things, for instance "move faster than X or not to be able to jump higher than Y" and then their actions are compared to the rules set in the "handshake".
To me this seems like an anti-cheat system.
What I am asking in the end is: What is Lockstep in Peer-To-Peer model, is it an Anti-Cheat system or something else and where should this system be placed in Peer-To-Peer. In every players computer or could it work if it is not in all of the players computer, should this system control the whole game, or only a subset?
Lockstep was designed primarily to save on bandwidth (in the days before broadband).
Question: How can you simulate (tens of) thousands of units, distributed across multiple systems, when you have only a vanishingly small amount of bandwidth (14400-28800 baud)?
What you can't do: Send tens of thousands of positions or deltas, every tick, across the network.
What you can do: Send only the inputs that each player makes, for example, "Player A orders this (limited size) group ID=3 of selected units to go to x=12,y=207".
However, the onus of responsibility now falls on each client application (or rather, on developers of P2P client code) to transform those inputs into exactly the same gamestate per every logic tick. Otherwise you get synchronisation errors and simulation failure, since no peer is authoritative. These sync errors can result from a great deal more than just cheaters, i.e. they can arise in many legitimate, non-cheating scenarios (and indeed, when I was a young man in the '90s playing lockstepped games, this was a frequent frustration even over LAN, which should be reliable).
So now you are using only a tiny fraction of the bandwidth. But the meticulous coding required to be certain that clients do not produce desync conditions makes this a lot harder to code than an authoritative server, where non-sane inputs or gamestate can be discarded by the server.
Cheating: It is easy to see things you shouldn't be able to see: every client has all the simulation data available. It is very hard to modify the gamestate without immediately crashing the game.
I've accidentally stumbled across this question in google search results, and thought I might as well answer years later. For future generations, you know :)
Lockstep is not an anti-cheat system, it is one of the common p2p network models used to implement online multiplayer in games (most notably in strategy games). The base concept is fairly straightforward:
The game simulation is split into fairly short time frames.
After each frame players collect input commands from that frame and send those commands over the network
Once all the players receive the commands from all the other players, they apply them to their local game simulation during the next time frame.
If simulation is deterministic (as it should be for lockstep to work), after applying the commands all the players will have the same world state. Implementing the determinism right is arguably the hardest part, especially for cross-platform games.
Being a p2p model lockstep is inherently weak to cheaters, since there is no agent in the network that can be fully trusted. As opposed to, for example, server-authoritative network models, where developer can trust a privately-owned server that hosts the game. Lockstep does not offer any special protection against cheaters by itself, but it can certainly be designed to be less (or more) vulnerable to cheating.
Here is an old but still good write-up on lockstep model used in Age of Empires series if anyone needs a concrete example.

what are the advantages of implementing register in microcontroller architectures i.e. load store architecture

Major difference in RISC and CISC is that in RISC we must need to use registers to do any arithmetic or logic operation. But in case of CISC we can do such operation directly with memory locations. So what is the advantage of implementing register banking in micro controller architectures? Question is not for the advantage of RISC but the question is for what is need of register in RISC architecture. As in other architecture CISC operation can be done directly with meomery location we don't need to take it in register and then again move into the memory location. Below is the example:
CISC: MUL A,B
RISC:
LDA R0,A
LDA R1,B
MUL R0,R1
STR A,R0
So in above example what is the advantage of using R0 and R1 ie. registers. what is the advantage of load store architecture?
Register banking is something else, I assume you are simply asking about using a register directly or not. Well the memory access takes an eternity, even if cached. Several to hundreds of clock cycles for each of the operands where in RISC if you are assuming a pure register based scheme which not all are, the lines are getting fuzzy. With CISC if microcoded it is going to registers anyway, then the operation is happening, if not microcoded then it still gets latched into internal temporary storage (registers) and then the operation can begin. With risc you have a couple-three extra, simpler, instructions the latching to registers takes the same amount of time as it does in CISC. Now if the algorithm never uses that result or does not use it for a while, it might be a win for CISC (if not microcoded) but if the value is an intermediate value in an algorithm then a clear win for RISC. Even if everything is cached it is a half a dozen to dozen clock cycles to get each parameter and write it back, any cache misses and it is an eternity. Same for RISC but with more registers, and significantly faster access to those registers, zero or one clock for each value and to store back, for some percentage if not the whole algorithm.
As with any benchmarking it is trivial to show a RISC winning case and to show a CISC winning case.
The major difference between RISC and CISC is CISC are complicated time consuming instructions where RISC they are much simpler, you arrange the tasks you need to do and have tighter control over those tasks, you dont have a lot of waste per step. One could argue caches were created to deal with the inefficiencies of CISC or at least one popular one. Both benefit sure, but one relies on the other doesnt as much. Trivial to show CISC winning code and trivial to show RISC winning code. Same goes for VLIW, and others.
RISC designs are simpler, smaller, pipes can be shorter, compiler has more control over the performance, etc. So with microcontrollers you can have a very nice processor core with a 3 stage pipeline that is really low power and still quite efficient. The 6502, z80, 8051, etc have really died off for the most part, you still do see a lot of 8051s if you are looking, the desktop/laptop you might be reading this with probably has one 8051, but that is due to royalties and not because of its size or performance, you probably have several to dozens of ARM cores for every x86, within the same box or certainly around the house. A CISC is going to be relatively massive and inefficient, it might be possible to get the power consumption down to RISC levels, that may just be a matter of design and not CISC vs RISC, but the RISC implementations are doing a much better job at watts per mhz than the CISC implementations.
Using registers can simplify the operand fetching logic of functional unis. With CISC functional units should be able to fetch data from memory. With RISC, all the functional units will operate on registers as it is guaranteed that the data will be there, so less complicated.
Also, think of a case where you have multiple MUL operations some uses data at location A, some use B, shown below.
'MUL A, B'
'MUL C, B'
When you perform the operation in CISC, you will be reading B, twice. But in RISC, you load it to a register once, and can use multiple times. So less memory (cache) accesses.
Also think of number of bits needed to represent that MUL in CISC. As A, B, C can be memory locations, they could be anywhere within your address spaces. On the other hand with registers in RISC, bits needed to represent your operands are less, hence less complicated instruction set.
As from above responses, we can conclude that the using registers instead of direct memory location gives the benefit in efficiency in terms of clock cycle and so the power consumption. They also give the benefit in term of complexity of instructions.

cuda unified memory: memory transfer behaviour

I am learning cuda, but currently don't access to a cuda device yet and am curious about some unified memory behaviour. As far as i understood, the unified memory functionality, transfers data from host to device on a need to know basis. So if the cpu calls some data 100 times, that is on the gpu, it transfers the data only during the first attempt and clears that memory space on the gpu. (is my interpretation correct so far?)
1 Assuming this, is there some behaviour that, if the programmatic structure meant to fit on the gpu is too large for the device memory, will the UM exchange some recently accessed data structures to make space for the next ones needed to complete to computation or does this still have to be achieved manually?
2 Additionally I would be grateful if you could clarify something else related to the memory transfer behaviour. It seems obvious that data would be transferred back on fro upon access of the actual data, but what about accessing the pointer? for example if I had 2 arrays of the same UM pointers (the data in the pointer is currently on the gpu and the following code is executed from the cpu) and were to slice the first array, maybe to delete an element, would the iterating step over the pointers being placed into a new array so access the data to do a cudamem transfer? surely not.
As far as i understood, the unified memory functionality, transfers data from host to device on a need to know basis. So if the cpu calls some data 100 times, that is on the gpu, it transfers the data only during the first attempt and clears that memory space on the gpu. (is my interpretation correct so far?)
The first part is correct: when the CPU tries to access a page that resides in device memory, it is transferred in main memory transparently. What happens to the page in device memory is probably an implementation detail, but I imagine it may not be cleared. After all, its contents only need to be refreshed if the CPU writes to the page and if it is accessed by the device again. Better ask someone from NVIDIA, I suppose.
Assuming this, is there some behaviour that, if the programmatic structure meant to fit on the gpu is too large for the device memory, will the UM exchange some recently accessed data structures to make space for the next ones needed to complete to computation or does this still have to be achieved manually?
Before CUDA 8, no, you could not allocate more (oversubscribe) than what could fit on the device. Since CUDA 8, it is possible: pages are faulted in and out of device memory (probably using an LRU policy, but I am not sure whether that is specified anywhere), which allows one to process datasets that would not otherwise fit on the device and require manual streaming.
It seems obvious that data would be transferred back on fro upon access of the actual data, but what about accessing the pointer?
It works exactly the same. It makes no difference whether you're dereferencing the pointer that was returned by cudaMalloc (or even malloc), or some pointer within that data. The driver handles it identically.

How to write with a single node in MPI

I want to implement some file io with the routines provided by MPI (in particular Open MPI).
Due to possible limitations of the environment, I wondered, if it is possible to limit the nodes, which are responsible for IO, so that all other nodes are required to perform a hidden mpi_send to this group of processes, to actually write the data. This would be nice in cases, where e.g. the master node is placed on a node with high-performance filesystem and the other nodes have only access to a low-performance filesystem, where the binaries are stored.
Actually, I already found some information, which might be helpful, but I couldn't find further information, how to actually implement these things:
1: There is an info key MPI_IO belonging to the communicator, which tells which ranks provide standard-conforming IO-routines. As this is listed as an environmental inquiry, I don't see, where I could modify this.
2: There is an info key io_nodes_list which seems to belong to file-related info-objects. Unfortunately, the possible values for this key are not documented and Open MPI doesn't seem to implement them in any way. Actually, I can't even get the filename from the info-object which is returned by mpi_file_get_info...
As a workaround, I could imagine two things: On the one hand, I could perform the IO with standard Fortran routines, or on the other hand, create a new communicator, which is responsible for IO. But in both cases, the processes, which are responsible for IO have to check for possible IO from the other processes to perform manual communication and file interaction.
Is there a nice and automatic way to restrict the IO to certain nodes? If yes, how could I implement this?
You explicitly asked about OpenMPI, but there are two MPI-IO implementations in OpenMPI. The old workhorse is ROMIO, the MPI-IO implementation shared among just about every MPI implementation. OpenMPI also has OMPIO, but I don't know a whole lot about tuning that one.
Next, if you want things to happen automatically for you, you'll have to use collective i/o. The independent I/O routines cannot send a message to anyone else -- they are independent and there's no way to know if the other side will be listening.
With those preliminaries out of the way...
You are asking about "i/o aggregaton". There is a bit of information here in the context of another optimization called "deferred open" (and which OMPIO calls Lazy Open)
https://press3.mcs.anl.gov/romio/2003/08/05/deferred-open/
In short, you can definitely say "only these N processes should do I/O", and then the collective I/O library will exchange data and make sure that happens. The optimization was developed some 15-odd years ago for just the situation you proposed: some nodes being better connected to storage than others (as was the case on the old ASCI Red machine, to give you a sense for how old this optimization is...)
I don't know where you got io_nodes_list. You probably want to use the MPI-IO info keys cb_config_list and cb_nodes
So, you've got a cluster with master1, master2, master3, and compute1, compute2, compute3 (or whatever the hostnames actually are). You can do something like this (in c, sorry. I'm not proficient in Fortran):
MPI_Info info;
MPI_File fh;
MPI_Info_create(&info);
MPI_Info_set(info, "cb_config_list", "master1:1,master2:1,master3:1");
MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE|MPI_MODE_WRONLY, info, &fh)
With these hints, MPI_File_write_all will aggregate all the I/O through the MPI processes on master1, master2, and master3. ROMIO won't blow up your memory because it will chunk up the I/O into a smaller working set (specified with the "cb_buffer_size" hint: cranking this up, if you have the memory, is a good way to get better performance).
There is a ton of information about the hints you can set in the ROMIO users guide:
http://www.mcs.anl.gov/research/projects/romio/doc/users-guide/node6.html

Moving data from memory to memory in micro controllers

Why can't we move data directly from a memory location into another memory location.
Pardon me if I am asking a dumb question, but I think this is a true situation, at least for the ones I've encountered (8085,8086 n 80386)
I am not really looking for a solution for moving the data (like for eg, using movs n all), but actually the reason for this anomaly.
What about MOVS? It moves a 8/16/32-bit value addressed by esi to the location addressed by edi.
The basic reason is that most instruction sets allow one register operand, and one memory operand, and sticking to this format makes designing the instruction decoder easier. It also makes the execution engine inside the CPU easier, because the instruction can issue typically a memory operation to just one memory location, and at most one register block read or write.
To do a memory-to-memory instruction directly requires two memory locations to be designated. This is awkward given a register/memory instruction format. Given the performance of the machines, there is little justification for modifying the instruction format just for this.
A hack used by more modern CPUs is to provide some type of block-move instruction, in which the source and destination locations are located in registers (for the X86 this is ESI and EDI respectively). Then an instruction can just designate two registers (or in the case of the x86, instructions that simply know which registers). That solves the instruction decoding problem.
The instruction execution problem is a little harder but people have lots of transistors. Organizing a read indirect from one register, and write indirect through another, and increment both is awkward in silicon but that just chews up some transistors.
Now you can have an instruction that moves from memory to memory, just as you asked.
One of the other posters noted for the X86 there are instrucitons (MOVB, MOVW, MOVS, ...) that do exactly this, one memory byte/word/... at a time.
Moving a block of memory would be ideal because the CPU can generate high-bandwith reads and writes. The x86 does this with with a REP (repeat) prefix on MOV- to move a larger block.
But if a single insturction can do this, you have the problem that it might take a long time to execute (how long to move 1Gb? --> millions of clock cycles!) and that ruins the interrupt response rate of the CPU.
The x86 solves this by allowing REP MOV- to be interrupted, with the PC being set back to the beginning of the instruction. By updating the registers during the move appropriately, you can interrupt and restart the REP MOV- instruction having both a fast block move and high interrupt response rates. More transistors down the tube.
The RISC guys figured out that all this complexity for a block move instruction was mostly not worth it. You can code a dumb loop (even the x86):
copy: MOV EAX,[ESI]
ADD ESI,4
MOV [EDI],EAX
ADD EDI,4
DEC ECX
JNE copy
which does the same basic thing as REP MOV- . Pretty much the modern CPUs (x86, others) execute this so fast (superscalar, etc.) that the bus is just as utilized as the custom move instruction, but now you don't need all those wasted transistors (or corresponding heat).
Most CPU varieties don't allow memory-to-memory moves. Normally the CPU can access only one memory location at at time, which means you need a temporary spot to store the value when moving it (a general purpose register, usually). If you think about it, moving directly from one memory location to another would require that the CPU be able to access two different spots in RAM simultaneously - that means two full memory controllers at least, and even then, the chances they'd "play nice" enough to access the same RAM would be pretty bad. The chip designers might have been able to pull some tricks to allow direct copies from one RAM chip to another, but that would be a pretty special-application kind of feature that would just add cost and complexity to solve a very uncommon problem.
You might be able to use some special DMA hardware to make it look to your program like memory is being moved without that temporary storage, at least from the perspective of your CPU.
You have one set of address lines, one set of data lines, and a few control lines between the CPU and RAM. You can't physically move directly from memory to memory without a second set of address lines and a whole bunch of complicated logic inside the RAM. Therefore, we have to store it temporarily in a register.
You could make an instruction that does the load and store together and looks like one instruction to the programmer, but there are other considerations like instruction size, non-duplication of effective address calculation logic, pipelining, etc. that make it desirable to keep it more simple.
Memory-memory machines turn out to be slower in general than load-store machines. This was deduced/figured out/invented by the RISC researchers in 1980ish or so. So the older architectures (VAX/OS360) tend to have memory-memory architectures; newer machines do load-store.
Another interesting variant is stack machines; they seem to always be around as a minority.

Resources