Size of intel x86 Segment registers and GDT(LDT) Register - cpu-registers

I'm a beginner level of student in the system architecture, to be precisely intel x86.
Currently I'm reading Intel's manual (1,3a,3b,3c) and I'm stucked in segmentation part.
As far as I know, in the Protected mode, system is translating a logical memory to the linear memory ( or physical memory )
and a "far pointer" is pointing an actual linear (or physical ) memory address with 2 different parts,
a segment selector and an offset.
As I learned from university, each segment registers has 16 bits portion of data,
According to Intel's manual, 16bits are only the visible part of segment register,
but there is more hiddnen part of segment register which unable to program or access by user.
Is any chance that I could know an actual size of segment register?
Second question is about LDT , GDT , IDT register for protect mode.
Are those register (LDTr,GDTr,IDTr) an actual register in cpu chipset?
If it is, is any chance to access those table after boot sequence ( prevelige ring 3, user mode )?
Thank you for read my question.
PS. I tried to google it , and I couldn't find any answer.
That's why I'm spending my time to write this question.

The segment registers are 16 bits. The segment descriptors that the segment registers refer to are larger. The confusing thing is that all i386 and later processors have a small non-coherent cache of segment descriptors that correspond to the segment registers (one cached descriptor for each segment register) that is sometimes referred to as the hidden part of the segment register. It's not really part of the register, though each entry in the cache is closely associated with a specific segment register. The cache is tightly tied to the segment registers, in that whenever a segment register is written to, the corresponding cache element is updated (re-read from memory), and instructions that use a segment register use the cached descriptor corresponding to that register rather than reading the descriptor from memory.

x86 segment registers are all 16-bit. I'm not aware of any "hidden" segment register portion. If you were unable to find anything on this via a Google search, I doubt that it exists.
For a good description of the Local Descriptor Table (LDT), the Global Descriptor Table (GDT), and the Interrupt Descriptor Table (IDT), there is a good description on Wikipedia: http://en.wikipedia.org/wiki/Intel_8086.

Related

What happens before Micro-controller Startup Code being executed? or Power-On/Reset Sequence?

What I know is as below and correct me if wrong, For the automotive bootloader based on any microcontroller, we will have
Startup code (Flash)
Primary bootloader (Flash)
Secondary bootloader (RAM)
As far as a power-on sequence is considered I know that,
From the startup code (provided by the micro vendor, Freescale, ST
Micro, etc.,) the control will be transferred to PBL (Primary
bootloader) using jump or function pointer.
PBL will download the SBL (Secondary bootloader) into RAM, which will
contain the flash driver, capable to download the application.
SBL will download the application into the flash area.
But what will happen before startup code is being executed or just after power on?
I know that each controller will have some sort of code to execute after power on POST (power-on self-test) but still not clear with sequence to operation till bootloader execution comes into execution.
It would be a great help if someone can provide a sequence of operations to reach startup code?
I find it this not uncommon confusion interesting.
POST is software in general, but your question is so vague. Usually when someone talks about POST they are talking about their x86 based computer, that is just software, happens well after the part you are confused about, and is in no way whatsoever required for a computer/processor to run, it has a purpose, adds value so it is there.
Microcontrollers in general do not have primary bootloaders nor secondary, they simply start running your application. Of the dozens/hundreds I have used/examined trying to think of any that have a primary or secondary. Can't think of any off hand. Certain brands in particular do have bootloaders that are usually programmed by them and you cant change or some that you can. How you get into the bootloader varies by brand, often a strap, sometimes a non-volatile bit in a register.
First off processors and the chip around it are dumb, very dumb. Only do what they are told by the humans. Incredibly simple machines. And while the difference between an mcu and a full blown system are at this view pretty much identical, the mcus are simpler and more reliable (for various reasons). The root of the answer starts with the processor or processor core or core or whatever term
might help you. In an mcu this is just one lego block in the whole of the chip, not necessarily even the largest block in the chip. When you look at arm based chips like the stm32 and others with a cortex-m (or older ones with ARMV7TDMI) that lego block is purchased ip from arm, the rest of the chip is either other purchased ip from one or more vendors or in-house made logic. the sram certainly and the flash probably is ip that the chip vendor buys for the specific process on the specific foundry (just like other cell library items, simple gates like AND, OR, NOT and more complicated gates).
Whatever processor core this is, it has an architecture and instruction set. While we know some architectures are implemented using microcode, unlikely that the mcus are, makes no sense the more cisc like might, but the arms and mips and such definitely not. But for this understanding it doesn't matter being microcoded or not there are bit patterns that drive the processor, machine code. We have all heard that chips are made of transistors, and they are. The transistors are part of the simplicity, the basic ones AND, OR, NOT gates you can look up on Wikipedia. You can (inefficiently) build the rest out of those fundamental blocks. A particular instruction tickles the logic, the transistors in a certain way to cause a chain of events, ones and zeros in a specific sequence that do the thing you asked. Logic is not limited to implementing processor instructions, most logic is not part of the decoding and execution of a processor instruction, most if it are equally dumb items. An sram is a lot of packed in bits (four transistors wired up a certain way per bit) with an address and data bus, the logic of an sram lights up rows and columns of these bits when writing or reading. Then there is more logic in front of that sram that decodes an address bus, etc.
As mentioned in the other answer, when power comes up then reset is released, the flip flop based items in the chip which are the registers we read in the manual plus countless others that are behind the scenes are set to their reset value which is done by wiring of the transistors. A number of state machines start which are similar to programs, but are hardwired. wait for reset to go high, once reset goes high then if this input to the state machine is this and that input to the state machine is that then I can move to the next state. The rules to get from one state to the next are implemented in logic. A chip with memory and flash for example might do a bist on the ram first, likely not in an mcu, doesn't make sense, this is logic not software doing this, this is not the post you think of in your laptop/desktop/server. The flash or ram or adcs or other logic might require some number of clocks to settle their logic before reset is released (the reset on the edge of the chip is not necessarily hard wired to all items in the chip, usually it is gated, delayed, etc). So there is a power on state machine that manages this, when the chip is ready then the processor itself will be released, this can be a few or dozens of clock cycles later. The clock itself has to settle, and the logic is designed to wait for that.
When the processor is released from reset it again may have some number of clocks to settle things in its design, it will have a state machine or many that start up the various blocks, and then based on the architectural design of that processor it does one of two things, fetches its first instruction from a known address (address within the processors address space which isn't necessarily the address in the chips view), or it uses a vector table approach and it reads a value from a known address, and that value read is the address of the first instruction and it fetches that instruction. Up to the first fetch there is no software, it is logic.
Depending on how the chip vendor has designed the chip, how they have defined the address space, and understand that addressing within a chip or board design is not some flat universal thing, to the programmer it is, but in reality it isn't. There are many busses with addresses and those address spaces are specific to that portion of the design. When you see the stm32 or others with a bootloader and a strap (boot0/boot1 pin), the logic on the other end of the processor bus may see a fetch at the well known address (meaning both the folks that implement the logic and the folks that write software for the logic know that this is the specific address where things start and if you don't put stuff there it won't boot/work) but as mentioned the chip vendor can do whatever they want with that and often do. As a programmer this can be easily understood as logic isn't any more magical than software:
if strap == 0 return flash_bank_0[address&mask]
else return flash_bank_1[address&mask]
For a certain address range that is decoded in front of this code, but also both banks may be directly addressable:
if address[24]==0 return flash_bank_0[address&mask]
else return flash_bank_1[address&mask]
And this way you can have what you see in the stm32s, that both address 0x00000000 and 0x08000000 or in other vendors chips 0x00000000 and 0x01000000 for example map to the same (flash) memory.
The reason being is that the cortex-ms is vector based, there is a table of addresses that point you at code rather than just instructions at known addresses (like the full sized arms arm7, arm9, arm11, cortex-a). The way you use that is you set your address for reset in the table to be 0x08000000 based so when the processor reads at 0x000000xx it is told to fetch instructions from 0x0800xxxx and it does. When the strap is the other way it finds a different flash which may or may not have a fixed space it may only be visible from the if-then-else. (pretty easy to see with a cortex-m and an SWD debugger and software).
The stm32s will have logic that if the strap is set to run the user application will fetch my guess is four words, if the first one or a specific one is all ones or for some chips all zeros (very often flash/rom resets to ones, because there is a logic in version saving a transistor, so the bit is a zero, but we see it as a one, the bits are all inverted, but this is not a hard and fast rule, just very common) the logic/state machine will, for the stm32 realize there is no user application and will load the bootloader. Now it is very possible the design actually always boots the bootloader and there is software there that looks at the application flash, but I think myself and others on this site decided that is not the case, but none of us work there nor have the visibility into the design. In either case the processor then starts executing what it finds and it is very dumb it is told fetch from this address and it does, the programmer had to make sure that stuff is at that address, and each and every instruction has to be laid out in order properly like train tracks, any gaps or mistakes and the trail goes off the rails, otherwise the train is stupid it just follows the tracks. As humans we call the software post or bootloader or application or whatever. It is just software. Once the processor is started if some software loads and runs other software the processor doesn't know it is stupid it just keeps performing the instructions it is fed as it rolls down the track.
Short answer:
Power ramps up to a chip specified level. At a chip specified time reset should be released. This releases state machines to get the chip ready as needed and release the processor. The processor based on its design either fetches its first instruction from a known place or it reads from a known place and that user planted value is the address where the first instruction lives. After that per the architecture of the chip the execution of that first instruction and fetching of more based on that instruction continue until it crashes or is turned off or put in reset.
There is no magic.
There are a number of good open cores out there that you can simulate with free tools and see (with free tools) the internal signals that make that chip work, you can see the post reset activity leading up to the first fetch and then all the execution from there.
Without knowing which microcontroller you are using, this should be general enough:
The hardware in the microcontroller resets several registers to their documented values. This includes the PC, the program counter.
If the microcontroller has configurable reset vectors the value can be chosen from a few alternatives, other controllers always use the same value.
The code at the location the PC points to is the startup code.
Note: It's always a good idea to read the data sheet of the controller!

Why Motorola 68k's 32-bit general-purpose registers are divided into data registers and address registers?

The 68k registers are divided into two groups of eight. Eight data registers (D0 to D7) and eight address registers (A0 to A7). What is the purpose of this separation, would not be better if united?
The short answer is, this separation comes from the architecture limitations and design decisions made at the time.
The long answer:
The M68K implements quite a lot of addressing modes (especially when compared with the RISC-based processors), with many of its instructions supporting most (if not all) of them. This gives a large variety of addressing modes combinations within every instruction.
This also adds a complexity in terms of opcode execution. Take the following example:
move.l $10(pc), -$20(a0,d0.l)
The instruction is just to copy a long-word from one location to another, simple enough. But in order to actually perform the operation, the processor needs to figure out the actual (raw) memory addresses to work with for both source and destination operands. This process, in which operands addressing modes are decoded (resolved), is called the effective address calculation.
For this example:
In order to calculate the source effective address - $10(pc),
the processor loads the value of PC (program) counter register
and adds $10 to it.
In order to calculate the destination effective address -
-$20(a0,d0.l), the processor loads the value of A0 register, adds the value of D0 register to it, then subtracts
$20.
This is quite a lot of calculations of a single opcode, isn't it?
But the M68K is quite fast in performing these calculations. In order to calculate effective addresses quickly, it implements a dedicated Address Unit (AU).
As a general rule, operations on data registers are handled by the ALU (Arithmetic Logical Unit) and operations involving address calculations are handled by the AU (Address Unit).
The AU is well optimized for 32-bit address operations: it performs 32-bit subtraction/addition within one bus cycle (4 CPU ticks), which ALU doesn't (it takes 2 bus cycles for 32-bit operations).
However, the AU is limited to just load and basic addition/subtraction operations (as dictated by the addressing modes), and it's not connected to the CCR (Conditional Codes Register), which is why operations on address registers never update flags.
That said, the AU should've been there to optimize calculation of complex addressing modes, but it just couldn't replace the ALU completely (after all, there were only about 68K transistors in the M68K), hence there are two registers set (data and address registers) each having their own dedicated unit.
So this is just based on a quick lookup, but using 16 registers is obviously easier to program. The problem could be that you would then have to make instructions for each of the 16 registers. Which would double the number of opcodes needed. Using half for each purpose is not ideal but gives access to more registers in general.

cuda unified memory: memory transfer behaviour

I am learning cuda, but currently don't access to a cuda device yet and am curious about some unified memory behaviour. As far as i understood, the unified memory functionality, transfers data from host to device on a need to know basis. So if the cpu calls some data 100 times, that is on the gpu, it transfers the data only during the first attempt and clears that memory space on the gpu. (is my interpretation correct so far?)
1 Assuming this, is there some behaviour that, if the programmatic structure meant to fit on the gpu is too large for the device memory, will the UM exchange some recently accessed data structures to make space for the next ones needed to complete to computation or does this still have to be achieved manually?
2 Additionally I would be grateful if you could clarify something else related to the memory transfer behaviour. It seems obvious that data would be transferred back on fro upon access of the actual data, but what about accessing the pointer? for example if I had 2 arrays of the same UM pointers (the data in the pointer is currently on the gpu and the following code is executed from the cpu) and were to slice the first array, maybe to delete an element, would the iterating step over the pointers being placed into a new array so access the data to do a cudamem transfer? surely not.
As far as i understood, the unified memory functionality, transfers data from host to device on a need to know basis. So if the cpu calls some data 100 times, that is on the gpu, it transfers the data only during the first attempt and clears that memory space on the gpu. (is my interpretation correct so far?)
The first part is correct: when the CPU tries to access a page that resides in device memory, it is transferred in main memory transparently. What happens to the page in device memory is probably an implementation detail, but I imagine it may not be cleared. After all, its contents only need to be refreshed if the CPU writes to the page and if it is accessed by the device again. Better ask someone from NVIDIA, I suppose.
Assuming this, is there some behaviour that, if the programmatic structure meant to fit on the gpu is too large for the device memory, will the UM exchange some recently accessed data structures to make space for the next ones needed to complete to computation or does this still have to be achieved manually?
Before CUDA 8, no, you could not allocate more (oversubscribe) than what could fit on the device. Since CUDA 8, it is possible: pages are faulted in and out of device memory (probably using an LRU policy, but I am not sure whether that is specified anywhere), which allows one to process datasets that would not otherwise fit on the device and require manual streaming.
It seems obvious that data would be transferred back on fro upon access of the actual data, but what about accessing the pointer?
It works exactly the same. It makes no difference whether you're dereferencing the pointer that was returned by cudaMalloc (or even malloc), or some pointer within that data. The driver handles it identically.

How to know whether Power on reset or Software reset has occured in 8051 microcontrller

I am developing an application on ATMEL AT89C51 of 8051 family.
Could anyone suggest how to determine in coding whether the reset has been done due to power cycle or through software?
According to the Atmel 8051 Microcontrollers Hardware Manual (PDF link), the power-off flag (POF / bit 4) in the power control register (PCON / 87h) is set by hardware when VCC rises from 0 to its nominal voltage. The power-off flag reset value will be 1 only after a power on (cold reset). A warm reset (e.g. software reset) does not affect the value of this bit.
I've often found that different vendors implement their own registers in the SFR space that can be taken advantage of for cases such as this. For example, Silicon Labs uses a power-on reset flag (PORSF) in their reset source register (RSTSRC).
It really depends if you wanted to depend on some specific 8051 variant vendor. It is best to use vendor provided registers, but if you changed vendor your code will brake, or even worse, it will misbehave.
If you had external RAM in your system (and it was not battery powered), than you could write a sequence of bytes (like 0xAA, 0x55...) somewhere in the reserved part of the memory, and check if it was still there after start up. If not, you have had a cold start. Of course, you should modify assembler start up code to make sure it does not initialize this part of memory (or it would be zero at each start), and you should instruct your linker to exclude this memory from linkage so that it does not get used by anything else.
Finally, include conditional compilation in your code so that if you had some 8051 variant with special registers, it would be used, if not, try the plan B.
I have done that with few bytes of internal 8051 memory /all my external RAM was battery powered/ and then I have learned than not every 8051 variant has had consistent policy at start up - some have all their internal memory initialized, some have initialized only SFR and some other specific areas leaving me few bytes to play with the procedure described.
I don't think there is a method to determine how reset has occurred because once reset everything starts from the beginning in 8051.
One method i guess would work is,
Say take a variable X, before every software code of reset, just set X=1 (indicating software reset) and store this variable in any ROM if you interfaced externally.
On every reset, at the beginning include an instance which checks this variable X to see which reset had occurred and change X to 0, for next time detection.
If you do not have an external ROM, interface an D-latch atleast.
I hope this works. Do tell me if this works.

Why is the GDT not considered a segment?

Besides code, data, and stack segments that make up the execution environment of a program or procedure, the architecture defines two system segments: the task-state segment (TSS) and the LDT.
The GDT is not considered a segment because it is not accessed by means of a segment selector and segment descriptor. TSSs and LDTs have segment descriptors defined for them.
-- Intel 64 and IA32 architecture software manual 3A,2.1.2
I'm little confused about why the GDT is not considered a segment.
Can any one give a detailed explanation?
The GDT, being the first point of lookup for everything, cannot be accessed via a selector because that would give you a chicken-and-egg situation.
The GDT itself has descriptors for the various TSS and LDT memory blocks, so they are considered to be segments, accessible via segment selectors. In addition, individual LDTs have selectors for other memory regions so those regions are segments as well.
But you just have to ask yourself in which table would you look up the descriptor to locate the GDT when the GDT is the first point of entry to the selection process?
In actual fact, when you load up the address of the GDT (with the LGDT instruction), it's a linear address you're using rather than a selector. From the x86 developer guide:
The LGDT and LIDT instructions are used only in operating-system software; they are not used in application programs. They are the only instructions that directly load a linear address (that is, not a segment-relative address) and a limit in protected mode. They are commonly executed in real-address mode to allow processor initialization prior to switching to protected mode.
That's why they say the GDT isn't considered a segment.

Resources