What happens before Micro-controller Startup Code being executed? or Power-On/Reset Sequence? - microcontroller

What I know is as below and correct me if wrong, For the automotive bootloader based on any microcontroller, we will have
Startup code (Flash)
Primary bootloader (Flash)
Secondary bootloader (RAM)
As far as a power-on sequence is considered I know that,
From the startup code (provided by the micro vendor, Freescale, ST
Micro, etc.,) the control will be transferred to PBL (Primary
bootloader) using jump or function pointer.
PBL will download the SBL (Secondary bootloader) into RAM, which will
contain the flash driver, capable to download the application.
SBL will download the application into the flash area.
But what will happen before startup code is being executed or just after power on?
I know that each controller will have some sort of code to execute after power on POST (power-on self-test) but still not clear with sequence to operation till bootloader execution comes into execution.
It would be a great help if someone can provide a sequence of operations to reach startup code?

I find it this not uncommon confusion interesting.
POST is software in general, but your question is so vague. Usually when someone talks about POST they are talking about their x86 based computer, that is just software, happens well after the part you are confused about, and is in no way whatsoever required for a computer/processor to run, it has a purpose, adds value so it is there.
Microcontrollers in general do not have primary bootloaders nor secondary, they simply start running your application. Of the dozens/hundreds I have used/examined trying to think of any that have a primary or secondary. Can't think of any off hand. Certain brands in particular do have bootloaders that are usually programmed by them and you cant change or some that you can. How you get into the bootloader varies by brand, often a strap, sometimes a non-volatile bit in a register.
First off processors and the chip around it are dumb, very dumb. Only do what they are told by the humans. Incredibly simple machines. And while the difference between an mcu and a full blown system are at this view pretty much identical, the mcus are simpler and more reliable (for various reasons). The root of the answer starts with the processor or processor core or core or whatever term
might help you. In an mcu this is just one lego block in the whole of the chip, not necessarily even the largest block in the chip. When you look at arm based chips like the stm32 and others with a cortex-m (or older ones with ARMV7TDMI) that lego block is purchased ip from arm, the rest of the chip is either other purchased ip from one or more vendors or in-house made logic. the sram certainly and the flash probably is ip that the chip vendor buys for the specific process on the specific foundry (just like other cell library items, simple gates like AND, OR, NOT and more complicated gates).
Whatever processor core this is, it has an architecture and instruction set. While we know some architectures are implemented using microcode, unlikely that the mcus are, makes no sense the more cisc like might, but the arms and mips and such definitely not. But for this understanding it doesn't matter being microcoded or not there are bit patterns that drive the processor, machine code. We have all heard that chips are made of transistors, and they are. The transistors are part of the simplicity, the basic ones AND, OR, NOT gates you can look up on Wikipedia. You can (inefficiently) build the rest out of those fundamental blocks. A particular instruction tickles the logic, the transistors in a certain way to cause a chain of events, ones and zeros in a specific sequence that do the thing you asked. Logic is not limited to implementing processor instructions, most logic is not part of the decoding and execution of a processor instruction, most if it are equally dumb items. An sram is a lot of packed in bits (four transistors wired up a certain way per bit) with an address and data bus, the logic of an sram lights up rows and columns of these bits when writing or reading. Then there is more logic in front of that sram that decodes an address bus, etc.
As mentioned in the other answer, when power comes up then reset is released, the flip flop based items in the chip which are the registers we read in the manual plus countless others that are behind the scenes are set to their reset value which is done by wiring of the transistors. A number of state machines start which are similar to programs, but are hardwired. wait for reset to go high, once reset goes high then if this input to the state machine is this and that input to the state machine is that then I can move to the next state. The rules to get from one state to the next are implemented in logic. A chip with memory and flash for example might do a bist on the ram first, likely not in an mcu, doesn't make sense, this is logic not software doing this, this is not the post you think of in your laptop/desktop/server. The flash or ram or adcs or other logic might require some number of clocks to settle their logic before reset is released (the reset on the edge of the chip is not necessarily hard wired to all items in the chip, usually it is gated, delayed, etc). So there is a power on state machine that manages this, when the chip is ready then the processor itself will be released, this can be a few or dozens of clock cycles later. The clock itself has to settle, and the logic is designed to wait for that.
When the processor is released from reset it again may have some number of clocks to settle things in its design, it will have a state machine or many that start up the various blocks, and then based on the architectural design of that processor it does one of two things, fetches its first instruction from a known address (address within the processors address space which isn't necessarily the address in the chips view), or it uses a vector table approach and it reads a value from a known address, and that value read is the address of the first instruction and it fetches that instruction. Up to the first fetch there is no software, it is logic.
Depending on how the chip vendor has designed the chip, how they have defined the address space, and understand that addressing within a chip or board design is not some flat universal thing, to the programmer it is, but in reality it isn't. There are many busses with addresses and those address spaces are specific to that portion of the design. When you see the stm32 or others with a bootloader and a strap (boot0/boot1 pin), the logic on the other end of the processor bus may see a fetch at the well known address (meaning both the folks that implement the logic and the folks that write software for the logic know that this is the specific address where things start and if you don't put stuff there it won't boot/work) but as mentioned the chip vendor can do whatever they want with that and often do. As a programmer this can be easily understood as logic isn't any more magical than software:
if strap == 0 return flash_bank_0[address&mask]
else return flash_bank_1[address&mask]
For a certain address range that is decoded in front of this code, but also both banks may be directly addressable:
if address[24]==0 return flash_bank_0[address&mask]
else return flash_bank_1[address&mask]
And this way you can have what you see in the stm32s, that both address 0x00000000 and 0x08000000 or in other vendors chips 0x00000000 and 0x01000000 for example map to the same (flash) memory.
The reason being is that the cortex-ms is vector based, there is a table of addresses that point you at code rather than just instructions at known addresses (like the full sized arms arm7, arm9, arm11, cortex-a). The way you use that is you set your address for reset in the table to be 0x08000000 based so when the processor reads at 0x000000xx it is told to fetch instructions from 0x0800xxxx and it does. When the strap is the other way it finds a different flash which may or may not have a fixed space it may only be visible from the if-then-else. (pretty easy to see with a cortex-m and an SWD debugger and software).
The stm32s will have logic that if the strap is set to run the user application will fetch my guess is four words, if the first one or a specific one is all ones or for some chips all zeros (very often flash/rom resets to ones, because there is a logic in version saving a transistor, so the bit is a zero, but we see it as a one, the bits are all inverted, but this is not a hard and fast rule, just very common) the logic/state machine will, for the stm32 realize there is no user application and will load the bootloader. Now it is very possible the design actually always boots the bootloader and there is software there that looks at the application flash, but I think myself and others on this site decided that is not the case, but none of us work there nor have the visibility into the design. In either case the processor then starts executing what it finds and it is very dumb it is told fetch from this address and it does, the programmer had to make sure that stuff is at that address, and each and every instruction has to be laid out in order properly like train tracks, any gaps or mistakes and the trail goes off the rails, otherwise the train is stupid it just follows the tracks. As humans we call the software post or bootloader or application or whatever. It is just software. Once the processor is started if some software loads and runs other software the processor doesn't know it is stupid it just keeps performing the instructions it is fed as it rolls down the track.
Short answer:
Power ramps up to a chip specified level. At a chip specified time reset should be released. This releases state machines to get the chip ready as needed and release the processor. The processor based on its design either fetches its first instruction from a known place or it reads from a known place and that user planted value is the address where the first instruction lives. After that per the architecture of the chip the execution of that first instruction and fetching of more based on that instruction continue until it crashes or is turned off or put in reset.
There is no magic.
There are a number of good open cores out there that you can simulate with free tools and see (with free tools) the internal signals that make that chip work, you can see the post reset activity leading up to the first fetch and then all the execution from there.

Without knowing which microcontroller you are using, this should be general enough:
The hardware in the microcontroller resets several registers to their documented values. This includes the PC, the program counter.
If the microcontroller has configurable reset vectors the value can be chosen from a few alternatives, other controllers always use the same value.
The code at the location the PC points to is the startup code.
Note: It's always a good idea to read the data sheet of the controller!

Related

What does CPU frequency represent in an Arduino board?

I'm new to Arduino and microcontrollers.
I was studying the specs and found that even the same board may have different frequencies with different input voltages (3.3V vs 5V). So the question is, what does frequency represent? Does it represent how many lines of assembly code it's able to run? Or the maximum PWM frequencies it's able to output?
A further question would be, if I'm looking for a board for a specific project, how do I decide which frequency I will need a priori, instead of trying everything out and see which one works?
What makes me more confused is that when it comes to computer CPUs, it seems that lower frequency CPUs can actually run faster than higher frequency ones (e.g. Intel). So how do I actually know how fast a microcontroller can run?
By frequency we mean the frequency of the CPU clock. Say your Arduino Uno runs on 16 MHz, which is 16,000,000 Hertz.
That means there are 16 Million clock cycles per second. The CPU executes the byte-code of the program. One Assembler instruction can actually take any number of CPU cycles to execute, usually between 1 and 4 cycles for simple stuff, and a little bit more heavy arithmetic and writing to memory. So it's a rough estimate of how many "lines of assembler" (that is, byte-code instructions) it can run per second. A measurement which is a little bit better is the "MIPS" value, the "Millions of instructions per second". There are other benchmark types for CPUs which are more accurate.
If you take at the datasheet for the AVR microprocessor architecture, you can see the cycles that each instructions needs: (link: http://www.atmel.com/images/Atmel-0856-AVR-Instruction-Set-Manual.pdf)
So for an ADD Rd, Rr instruction, an AVR CPU needs 1 clock cycle.
Take a desktop Intel CPU for example. It's common these days that they have a clock frequency of 2 GHz or more, which is 2 Billion cycles per second compared to the 16 million cycles per second on the Arduino's AVR CPU. So the Intel CPU beats the Arduino by far. Then again, the Arduino is designed for completley different stuff - it's a small microcontroller with low overhead, runs no OS etc. The use-case for such a CPU (and the architecture) is just different, which makes comparing them unjustified. There many other factors in play, like multi-core CPUs (4 Intel CPU cores vs. 1 AVR) and command pipelining, the speed of your memory / RAM, etc. It's really hard to compare a CPU to another one in every use-case possible, but for "general purpose computing", the Desktop CPUs (AMD, Intel, x64 architecture) far outruns the processing power of a mere Arduino AVR CPU.
I hope this clears up some confusion.
I think one confusion you have may be chip specific, I am not going to look it up right now but I do remember seeing this, the chip spec may say that for this input voltage range it can handle this frequency and for this voltage range it cannot. I think sparkfun the 3.3 are 8mhz and 5.0 are 16mhz or something like that. Anyway, that is not generally the case, but it is a chip by chip vendor by vendor thing and that is why you have to read the datasheet. Has nothing to do with arduinos or avrs specifically, just a general chip design thing.
How do you know how fast your microcontroller can run? That is a very loaded question, depends on your definition of fast. If it is simply what clock frequencies can I use, well "just read the datasheet" for that part, and then depending on your board design choose from what is available, if you do not have any external clocks then your choices may or may not be more limited, you may or may not have a pll that you can use to multiply the clock source.
if your definition of fast is how fast can I perform this task, how many whatevers per second or how much wall clock time does it take to complete some specific task. Well that is a benchmark problem and there are so many variables that there is actually no real answer. Yes it is very true that an x86 can have a lower clock and run faster than some other x86, historically the newer ones can do less stuff per clock than older ones for the same binaries, you have to then tune the compile to the newer chip and then you might get back some of your mips to mhz. but that is in part because you are using a different chip design that just speaks the same language (machine code). You can have a tall person that can recite a poem faster than a short person, both using english and the same poem, has nothing to do with them being short or tall, just that they are different humans.
There are different avr core variations but not remotely on par with the different x86 architectures. so while comparing a tiny vs an xmega you can probably have the xmega run "faster" at the same clock rate simply because it has more registers or a bigger address space, etc. But instructions per second is probably not really different, could be, but my guess is not so much.
Then there is the compiler, the compiler plays a huge role in how "fast" your code runs, change compilers or compiler versions or compiler settings and the machine code produced from the same high level source code (C for example) can vary greatly and as a result can have dramatic effects on the "speed" of the code. Take the dhrystone for example, very easy to demonstrate that the same exact source code on the same exact chip/board, same clock rates, etc can execute at vastly different speeds based on either using different compilers, versions or command line settings, kinda proving that the godfather of benchmarks is basically useless in providing any meaningful information.
Microcontrollers make the problem much worse as you often are running the program out of flash, and many, not all, but many have the ability to either divide or multiply or both the clock, but the flash is not always designed for the full range. You might have a chip that boots on an internal clock at 8mhz but you can use the pll to multiply that up to say 80mhz. But not uncommon that the flash is limited to say 16mhz on a chip like that so at 8mhz the flash can deliver an item say an instruction every cpu clock, but at 20mhz you have to put a wait state and although the cpu is running much faster you can only feed it at 16mhz so it is waiting around more, and then acts fast when it gets something, is it really "faster" or is clocking up making you slower. Certainly at just under 16mhz in this fantasy chip I am describing you can keep it to zero wait states so it is really faster, not necessarily twice as as there are other factors, but definitely faster than 8mhz. just at or above 16mhz though you take a huge performance hit compared to just under 16mhz. at just under 32mhz though it is pretty fast compared to just under, then at just over 32mhz another wait state setting and much slower again even though the clock is basically the same and so on.
Then there is the fetching, how does the cpu actually fetch, like an arm where it fetches a bunch of data per fetch transaction, even if it is not going to execute all of them if you branch to 0x1004 and at that address there is a branch to 0x2008 the core might fetch 0x10 bytes from 0x1000 to 0x100F, THEN extract the 0x1004 word/instruction, decode it to find it is a branch then read 0x10 bytes from 0x2000. basically reading 0x20 bytes to find 2 instructions. Take two instructions if both are in the 0x10 bytes then good if one is at 0x100C and the other at 0x2000 that is a performance hit. take this internal information and apply it to an application and all of its jumping around, changing one line of code or adding or removing a single nop to the bootstrap (causing the alignment of the program to change in the address space) can cause anywhere from a tiny to a large change in performance, swap two helper function sin the source code of your program, in the text, causing them to land in different address spaces once compiled, can have little to major performance effects without actually changing the functions themselves.
So performance is first of a foolish task to go after in one respect, in the other respect all that matters is your program as written with the compiler you are using on the hardware you are using, it is as fast as it is, and there are things you can do to make that code faster on that compiler on that target on that day, by changing compiler settings or the code or both. And ideally you build your final firmware, performance test that, and never build again as if you build a year or two later it may be on a different host compiler with a different compiler or compiler version and all bets are off on performance.
How do you pick a board, how much flash, ram, features, clock rate. A lot of it is experience by just trial and error, you fortunately live in a time where you can literally try hundreds of boards all of which cost anywhere from a few bucks to like 10 or 20 each, different cpu architectures different chip vendors, etc. there are many compiler choices and even languages available, basically there are too many easy to acquire choices, unlike back in the day when the parts were pretty cheap but you may have had to build your own board, write your code in asm, maybe even create your own assembler, etc. Have a rom programmer that cost hundreds to thousands of dollars. So go with the AVR you have and play with its features, play with the compiler and/or write or both. Do experiments to see if there are fetch effects or not. If you have clocking choices mess with those see what happens.
Of course all of this starts with reading the chip documentation from the vendor.

network-on-chip verilog code

I have written and simulated a Verilog code in ISE Project Navigator 2013. this is an RTL model that describes the network-on-chip routers, buffers and links.
which device is better for synthesis and implementation?
How can I get the static and dynamic power consumption, a packet transfer delay, area and the other factors that indicates network performance, using ISE Project Navigator?
The question is very open ended so I will try to provide as general an answer as possible.
Now you have said that you have the code for a NOC Router in ISE. This would imply that you or the designer has a rough idea of the frequency at which the internal logic/system has to operate. The maximum clock tree frequency of your target device and would then be one of the key parameters that you need to check. If your design is running at around 150-200 MHz and is appropriately pipelined (small muxes, not more than 2-3 levels of logic between pipelining stages), then almost any of the currently available device families from both Xilinx and Altera should be suitable.
The next important consideration is of external connectivity. Does your design need high-speed serial connectivity with an external device. If that is true, then you would need to select a device that has high-speed SERDES IPs in-built. That would then limit your choice of devices.
Another factor to consider is interface to external SDRAM or RLDRAM. If your design needs to interface with such external devices, then you need to pick a device that has support either through a softcore or a Megafunction (Altera) or a hard IP block.
Finally you need to look at your logic utilization. You want to chose a device that is just big enough to satisfy your requirements, unless your design is part of a bigger project and there are modules that would be designed later and would sit alongside your NOC. You would have to make a rough guess of the number of LEs/LUTs that your design would need and pick a device 50% bigger than that. You can then run a trial synthesis run and check if your estimates are okay. If they are, and your device is less than 50% utilized, you could go in to a smaller device as need be.
There are also a few other considerations such as number of IOs, presence of a PLL/Clock manager that may affect your choice of device

Oscilloscope type design with FPGA PL and PS framebuffer interface?

I am generating a certain signal (digital pulse) in one of my verilog module running on programmable logic in Xilinx Zynq chip. Signal is pretty fast, with clock of about 200MHz.
I also have a simple linux and framebuffer Qt interface running for later controlling my application.
How can I sample my signal in order to make oscilloscope like interface inside my Qt app? I want to be able to provide visual of the pulse I am generating.
What do I need to use to be able to sample enough data at such clock frequency? And how do I pass it with kernel module or mmap to Qt?
You would do best to do what most oscilloscopes do: sample the data to RAM, and only then transfer it to the processor for display/analaysis, at a more "relaxed" pace.
On the FPGA side you will need a state machine that detects some sort of start or trigger condition, probably after a bit in a mode register is set from the software side to arm it.
The state machine will then fill samples into a buffer made of one or more block rams. If you want to placing the trigger somewhere in the middle of the samples captured, you should it as a circular buffer, and have it record continuously, stopping configurable number of samples after the trigger, so that some desired number from before the trigger condition remain un-overwritten by newer ones following it.
Since FPGA block rams are typically dual port, you can simply hook the other port up to your CPU bus for readout. You will probably want a register to read the state of the sampling state machine, and if you go with the circular buffer approach, the address where it stopped, so that you can unwrap the data to a linear record of time.
Trying to do streaming realtime sampling might be possible, but would be a lot harder and it is not clear that you could do anything meaningful with the data so produced in real time. Still, if you want to try you would probably need to put a FIFO buffer in between the sampling and the processor bus, as you will probably only be able to consume data in chunks, while having to service other operations in between, so something is needed to absorb the constant-rate inflow of samples. Another approach could be to try to build a DMA engine which would write samples directly to external system ram, but that will likely be even harder.
You could also see if there are any high speed interfaces available in the CPU which you could leverage - they might be things originally intended for video, for example.
It also appears that you are measuring only a digital signal, ie, probalby one bit. If you want to handle a higher input sample rate than the FPGA fabric can support, that could mean you could potentially use something like a deserializer block at the edge of the FPGA to turn the 1-bit input stream into a slower stream of wider samples to store.
In terms of output, once you have a vector of samples in a buffer it's pretty simple to turn that into a scope/logic analyzer type plot, with as much zooming, cursor annotation, automatic measurement or whatever you like.
Also don't forget that if the intent is only to use this during development, FPGAs and their tools often have the ability to build a logic analyzer right into the design, with the data claimed over the programming interface for plotting on a PC.

Serial Comms baud rate, parity and stop bits. Which options to use and when?

I'm trying to pick up some serial comms for a new job I am starting. I have done some reading which has helped a lot however, a lot of the reading tells you about the specification of serial comms and what everything is, but not when is best to use particular options.
My searches for this information so far only seem to pull in the spec; perhaps as a novice I am searching for the wrong terms.
My questions then!
Baud Rate - I have read this is signal changes per second and is often mislabelled as bits per second. Is this essentially bits per second including the frame data if asynchronous, and actually bits per second if synchronous?
Parity - Even/Odd.. Is there any difference at all between the two? I'm thinking in terms of efficiency or similar. Does this only still exist for compatabilities sake?
Stop Bits - I have read so far you can have 1 or 2 stop bits. In C# there seems to be an option for 1.5 too. I can't find anything on why you would want/need more than 1.
If anyone can advise on these points, or point me to some recommended reading material I would be very grateful.
Thanks for reading.
edit: typo
You very rarely have a choice, you must make it compatible with the settings that the device uses. If you don't know then you need to look in a manual or pick up a phone. Do keep in mind that it is increasing very rare to work with a real serial port device, one that uses an UART. Most commonly you actually talk to an emulated serial port, implemented by a USB or Bluetooth device driver. The settings you use don't matter in such a case since the actual signaling is implemented by the underlying bus.
If you can configure the device then basic guidelines are:
Baudrate is directly related to the length of the cable and the amount of electrical interference that's present. You have to go slower when you get bit errors. The RS-232 spec only allows for a maximum of 50 ft at 9600 baud.
Parity ought to be used when you don't use an error-correcting protocol. It does not matter whether you pick Odd or Even. Odd people pick odd, it's their prerogative.
Stopbits is usually 1. Picking 1.5 or 2 help a bit to relieve pressure on a device whose interrupt response times are poor, detected by data loss.
Databits is almost always 8, sometimes 7 if the device only handles ASCII codes.
Handshaking is an important setting that never stop causing trouble since many programmers just overlook it. Modern computers are almost always fast enough to not need it but that's not necessarily true for devices. The most basic stay-out-of-trouble configuration is to turn DTR on when you open the port and to tell the device driver to take care of RTS/CTS handshaking. Xon/Xoff handshaking is sometimes used, depends on the device.
A good 90% of the battle is won by implementing solid error checking. It is almost always skimped on, bad idea. Very important for serial port devices since they have no error correcting capabilities themselves and very weak error detection. Always make sure that you can detect and properly report overrun, parity and framing errors. And test them by getting the settings intentionally wrong.

How to know whether Power on reset or Software reset has occured in 8051 microcontrller

I am developing an application on ATMEL AT89C51 of 8051 family.
Could anyone suggest how to determine in coding whether the reset has been done due to power cycle or through software?
According to the Atmel 8051 Microcontrollers Hardware Manual (PDF link), the power-off flag (POF / bit 4) in the power control register (PCON / 87h) is set by hardware when VCC rises from 0 to its nominal voltage. The power-off flag reset value will be 1 only after a power on (cold reset). A warm reset (e.g. software reset) does not affect the value of this bit.
I've often found that different vendors implement their own registers in the SFR space that can be taken advantage of for cases such as this. For example, Silicon Labs uses a power-on reset flag (PORSF) in their reset source register (RSTSRC).
It really depends if you wanted to depend on some specific 8051 variant vendor. It is best to use vendor provided registers, but if you changed vendor your code will brake, or even worse, it will misbehave.
If you had external RAM in your system (and it was not battery powered), than you could write a sequence of bytes (like 0xAA, 0x55...) somewhere in the reserved part of the memory, and check if it was still there after start up. If not, you have had a cold start. Of course, you should modify assembler start up code to make sure it does not initialize this part of memory (or it would be zero at each start), and you should instruct your linker to exclude this memory from linkage so that it does not get used by anything else.
Finally, include conditional compilation in your code so that if you had some 8051 variant with special registers, it would be used, if not, try the plan B.
I have done that with few bytes of internal 8051 memory /all my external RAM was battery powered/ and then I have learned than not every 8051 variant has had consistent policy at start up - some have all their internal memory initialized, some have initialized only SFR and some other specific areas leaving me few bytes to play with the procedure described.
I don't think there is a method to determine how reset has occurred because once reset everything starts from the beginning in 8051.
One method i guess would work is,
Say take a variable X, before every software code of reset, just set X=1 (indicating software reset) and store this variable in any ROM if you interfaced externally.
On every reset, at the beginning include an instance which checks this variable X to see which reset had occurred and change X to 0, for next time detection.
If you do not have an external ROM, interface an D-latch atleast.
I hope this works. Do tell me if this works.

Resources