On what parameters boot sequence varies? - unix

Does every Unix flavor have same boot sequence code ? I mean there are different kernel version releases going on for different flavors, so is there possibility of different code for boot sequence once kernel is loaded? Or they keep their boot sequence (or code) common always?
Edit: I want to know into detail how boot process is done.
Where does MBR finds a GRUB? How this information is stored? Is it by default hard-coded?
Is there any block level partion architecture available for boot sequence?
How GRUB locates the kernel image? Is it common space, where kernel image is stored?
I searched a lot on web; but it shows common architecture BIOS -> MBR -> GRUB -> Kernel -> Init.
I want to know details of everything. What should I do to know this all? Is there any way I could debug boot process?
Thanks in advance!

First of all, the boot process is extremely platform and kernel dependent.
The point is normally getting the kernel image loaded somewhere in memory and run it, but details may differ:
where do I get the kernel image? (file on a partition? fixed offset on the device? should I just map a device in memory?)
what should be loaded? (only a "core" image? also a ramdisk with additional data?)
where should it be loaded? Is additional initialization (CPU/MMU status, device initialization, ...) required?
are there kernel parameters to pass? Where should they be put for the kernel to see?
where is the configuration for the bootloader itself stored (hard-coded, files on a partition, ...)? How to load the additional modules? (bootloaders like GRUB are actually small OSes by themselves)
Different bootloaders and OSes may do this stuff differently. The "UNIX-like" bit is not relevant, an OS starts being ostensibly UNIXy (POSIX syscalls, init process, POSIX userland,...) mostly after the kernel starts running.
Even on common x86 PCs the start differs deeply between "traditional BIOS" and UEFI mode (in this last case, the UEFI itself can load and start the kernel, without additional bootloaders being involved).
Coming down to the start of a modern Linux distribution on x86 in BIOS mode with GRUB2, the basic idea is to quickly get up and running a system which can deal with "normal" PC abstractions (disk partitions, files on filesystems, ...), keeping at minimum the code that has to deal with hardcoded disk offsets.
GRUB is not a monolithic program, but it's composed in stages. When booting, the BIOS loads and executes the code stored in the MBR, which is the first stage of GRUB. Since the amount of code that can be stored there is extremely limited (few hundred bytes), all this code does is to act as a trampoline for the next GRUB stage (somehow, it "boots GRUB");
the MBR code contains hard-coded the address of the first sector of the "core image"; this, in turn, contains the code to load the rest of the "core image" from disk (again, hard-coded as a list of disk sectors);
Once the core image is loaded, the ugly work is done, since the GRUB core image normally contains basic file system drivers, so it can load additional configuration and modules from regular files on the boot partition;
Now what happens depends on the configuration of the specific boot entry; for booting Linux, usually there are two files involved: the kernel image and the initrd:
initrd contains the "initial ramdrive", containing the barebones userland mounted as / in the early boot process (before the kernel has mounted the filesystems); it mostly contains device detection helpers, device drivers, filesystem drivers, ... to allow the kernel to be able to load on demand the code needed to mount the "real" root partition;
the kernel image is a (usually compressed) executable image in some format, which contains the actual kernel code; the bootloader extracts it in memory (following some rules), puts the kernel parameters and initrd memory position in some memory location and then jumps to the kernel entrypoint, whence the kernel takes over the boot process;
From there, the "real" Linux boot process starts, which normally involves loading device drivers, starting init, mounting disks and so on.
Again, this is all (x86, BIOS, Linux, GRUB2)-specific; points 1-2 are different on architectures without an MBR, and are are skipped completely if GRUB is loaded straight from UEFI; 1-3 are different/avoided if UEFI (or some other loader) is used to load directly the kernel image. The initrd thing may be not involved if the kernel image already bundles all that is needed to start (typical of embedded images); details of points 4-5 are different for different OSes (although the basic idea is usually similar). And, on embedded machines the kernel may be placed directly at a "magic" location that is automatically mapped in memory and run at start.

Related

Which program take care of loading from Flash to RAM and running program in Microcontroller Bare metal?

The program written in C and compiled on some other IDE/computer (or cross-compiling) and then loaded as binary data into the flash memory of the controller.
What am i not understanding in Bare Metal / No RTOS
Which program/code take care of loading from Flash to RAM?
Is the RAM in microcontroller have intelligence/program to understand binary or at time of compile the intelligence is added to the binary file by compiler?
Ideally your program runs in flash not ram. Many mcus you can, it would be an architecture limit primarily if running from ram is not supported. In a pinch you can run your code in ram if you need a trampoline to reprogram the flash as in downloading new firmware in the field (for a chip with only one flash bank that can't run and be erased/modified at the same time), or for performance, but if you need ram for performance then perhaps you need to rethink your design. small sections sure, but if the whole app has to be in ram for reasons other than development, you need to re-think your system design.
You can easily wrap your program with a small copy to ram bit of code, so that the mcu boots up the copy and jump program and then the main application runs in ram. that is your choice. somewhat trivial just a few lines of code. it is chip/architecture dependent on whether you can handle interrupts in that situation or how you need to design it (more than just a copy and jump for example, might need handlers in flash that hop over to ram too).
There is no magic here, the mcu processor is no different than others you need some non-volatile way to get the program in there. Like most others cpus your processor boots from a rom/flash, then as desired it works toward the final application be it an operating system or not. for an mcu the typical approach is to boot right into the application, run the application in flash for read only items (.text and .rodata) and the read-write in ram (.data, .bss) which is handled by knowing how to use your toolchain, which is a critical part of bare-metal success.
CPUs generally don't care about flash, ram, peripherals, they are just addresses, the cpu is very very dumb. You the programmer are smart you lay the tracks down for the cpu to follow, the instructions have to follow the rules and guide the processor. The processor starts in a well known way at a well known address or vector table, from there it is all on you to keep the processor on track by working within the address space where there are resources, flash, ram and peripherals. The processor may have rules on the address space it can fetch/execute from, or not, depends on the implementation. For implementations where the executable address space has both flash and ram then yes you can simply place code in ram and execute it.
Running code in ram on an mcu is the exception not the rule.
Commonly a microcontroller does not load the (single) program into RAM. Instead it is run "in-place" in the (flash or any other non-volatile) memory. The program is built so that the memory at the (fixed) start address contains the startup code of the program.
Having said that you might wonder how (static) variables are initialized with zero and non-zero values. That is done by the startup code linked in when the program is built.
There is no need to add any "intelligence", assuming you mean something like a byte-code interpreter to execute the binary commands. The CPU of the microcontroller executes the machine code directly. And your compiler generates exactly the machine code.

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.
It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.
The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c

Confusion as to how fork() and exec() work

Consider the following:
Where I'm getting confused is in the step "child duplicate of parent". If you're running a process such as say, skype, if it forks, is it copying skype, then overwriting that process copy with some other program? Moreover, what if the child process has memory requirements far different from the parent process? Wouldn't assigning the same address space as the parent be a problem?
I feel like I'm thinking about this all wrong, perhaps because I'm imagining the processes to be entire programs in execution rather than some simple instruction like "copy data from X to Y".
All modern Unix implementations use virtual memory. That allows them to get away with not actually copying much when forking. Instead, their memory map contains pointers to the parent's memory until they start modifying it.
When a child process exec's a program, that program is copied into memory (if it wasn't already there) and the process's memory map is updated to point to the new program.
fork(2) is difficult to understand. It is explained a lot, read also fork (system call) wikipage and several chapters of Advanced Linux Programming. Notice that fork does not copy the running program (i.e. the /usr/bin/skype ELF executable file), but it is lazily copying (using copy-on-write techniques - by configuring the MMU) the address space (in virtual memory) of the forking process. Each process has its address space (but might share some segments with some other processes, see mmap(2) and execve(2) ....). Since each process has its own address space, changes in the address space of one process does not (usually) affect the parent process. However, processes may have shared memory but then need to synchronize: see shm_overview(7) & sem_overview(7)...
By definition of fork, just after the fork syscall the parent and child processes have nearly equal state (in particular the address space of the child is a copy of the address space of the parent). The only difference being the return value of fork.
And execve is overwriting the address space and registers of the current process.
Notice that on Linux all processes (with a few exceptions, like kernel started processes such as /sbin/modprobe etc) are obtained by fork-ing -from the initial /sbin/init process of pid 1.
At last, system calls -listed in syscalls(2)- like fork are an elementary operation from the application's point of view, since the real processing is done inside the Linux kernel. Play with strace(1). See also this answer and that one.
A process is often some machine state (registers) + its address space + some kernel state (e.g. file descriptors), etc... (but read about zombie processes).
Take time to follow all the links I gave you.

JVM Monitoring jar execution

I want to check the memory usage of a JAR that does some calculations. For this I want to use JVM monitor. When starting JVM monitor, I need to pick the JVM that is running my jar. But the problem is that my JAR executes so fast (<1sec) that it never shows up in the list..
Is there any way I can start the JVM without executing the JAR immediatly?
JConsole finds running applications at the time when JConsole starts . Then only the currently running applications port and host will be displayed in the list. But for such very short running applications to be shown in the list adding a wait at the end of program execution is the only choice you can do .
Also whatever the memory stats Jconsole will display will include Jconsole's memory footprint as well. So the better choice for monitoring is jvisualvm which can show memory , threads and gc statistics as well .
Alternatevely if you want to check the code cache or compilation statistics you can use -XX:+LogCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintCodeCache

Why is PCD bit set when I don't use ioremap_cache?

I am using ubuntu 12.10 32 bit on an x86 system. I have physical memory(about 32MB,sometimes more) which is enumerated and reserved through the ACPI tables as a device so that linux/OS cannot use it. I have a Linux driver for this memory device. THe driver implements mmap() so that when a process calls mmap(), the driver can map this reserved physical memory to user space. I also sometimes do nothing in the mmap except setup the VMA's and point vma->vmops to the vm_operations_struct with the open close and fault functions implemented. When the application accesses the mmapped memory, I get a page fault and my .fault function is called. Here is use vm_insert_pfn to map the virtual address to any physical address in the 32MB that I want.
Here is the problem I have: In the driver, if I call ioremap_cache() during init, I get good cache performance from the application when I access data in this memory. However, if I don't call ioremap_cache(), I see that any access to these physical pages results in a cache miss and gives horrible performance. I looked into the PTE's and see that the PCD bit for these virtual address->physical translation are set, which means caching on these physical pages is disabled. We tried setting _PAGE_CACHE_WB in the vma_page_prot field and also used remap_pfn_range with the new vma_page_prot but PCD bit was still set in the PTE's.
Does anybody have any idea on how we can ensure caching is enabled for this memory? The reason I don't want to use ioremap_cache() for 32 MB is because there are limited Kernel Virtual Address on 32bit systems and I don't want to hold them.
Suggestions:
Read linux/Documentation/x86/pat.txt
Boot Linux with debugpat
After trying the set_memory_wb() APIs, check /sys/kernel/debug/x86/pat_memtype_list

Resources