how to understand the result of xxd -d dumper? - unix

I studied many subjects from physics to logic gates to processor.
I also studied computer architecture, compilers, Assembly x86, operating systems, GPU, ....
All the subjects mentioned above, for some reasons didn't cover what is going "after an executable file being produced by compilers" downward the processor.
Please could you provide me with resources explain these things. because the way I am thinking drive me crazy if I didn't understand why things works the way they are working.
Like I want to understand for-example; why UNIX files start with 'elf'? if you tell me it 's convention. then how the computer as machine understand that a file start by 'elf' being passed to it?
It is a job of operating system. Then how the computer understand the code of operating system ?. I know that processor will read it represented in Binary.
But how really the computer understand the binary? don't tell via transistors and logic gates. What I need to understand how the binary is being signaled to the computer hardware ? how to hardware is really being implemented to understand the binary?
Please any resources about this stage I mention above share it with me ?

Binary is really just electrical signals and small charges stored in memory circuits. The binary data (ex:0101) is just a number represented with 2 symbols instead of 10 (like in base 10 or decimal). What matters, is the way you interpret those numbers to give them meaning.
For example, in processor architectures like x64, the developers of the architecture will create an instruction set that has a certain amount of instructions with a very specific format. The binary data thus becomes interpreted to be those instructions. Since the CPU is manufactured in its logic circuits to understand those instructions, then it can execute them by doing what they should according to the architecture.
The CPU is also configured to start executing at a specific address. The motherboard manufacturer puts machine code in binary at that address to kickstart the CPU. This is called the firmware. The OS developer will put its machine code on the hard-drive at a conventional position and in a conventional format so that the firmware can find and execute it.
Even in electrical terms there needs to be a conventional interpretation of signals. A certain chip on your motherboard might interpret 3V as a high signal and another might be 5V. What matters is that interconnected chips understand each other's interface. For example, an x64 processor might be connected to modern DDR4 RAM. The CPU knows that if it applies a signal (let's say 5V square wave 2600MHz) to certain pins of that chip then each oscillation will write or read data in the memory cells that are selected. It knows that because it expects the RAM module to follow the DDR4 standard. If the RAM module doesn't follow the standard then the CPU will not be able to interact with it.

Related

Which program take care of loading from Flash to RAM and running program in Microcontroller Bare metal?

The program written in C and compiled on some other IDE/computer (or cross-compiling) and then loaded as binary data into the flash memory of the controller.
What am i not understanding in Bare Metal / No RTOS
Which program/code take care of loading from Flash to RAM?
Is the RAM in microcontroller have intelligence/program to understand binary or at time of compile the intelligence is added to the binary file by compiler?
Ideally your program runs in flash not ram. Many mcus you can, it would be an architecture limit primarily if running from ram is not supported. In a pinch you can run your code in ram if you need a trampoline to reprogram the flash as in downloading new firmware in the field (for a chip with only one flash bank that can't run and be erased/modified at the same time), or for performance, but if you need ram for performance then perhaps you need to rethink your design. small sections sure, but if the whole app has to be in ram for reasons other than development, you need to re-think your system design.
You can easily wrap your program with a small copy to ram bit of code, so that the mcu boots up the copy and jump program and then the main application runs in ram. that is your choice. somewhat trivial just a few lines of code. it is chip/architecture dependent on whether you can handle interrupts in that situation or how you need to design it (more than just a copy and jump for example, might need handlers in flash that hop over to ram too).
There is no magic here, the mcu processor is no different than others you need some non-volatile way to get the program in there. Like most others cpus your processor boots from a rom/flash, then as desired it works toward the final application be it an operating system or not. for an mcu the typical approach is to boot right into the application, run the application in flash for read only items (.text and .rodata) and the read-write in ram (.data, .bss) which is handled by knowing how to use your toolchain, which is a critical part of bare-metal success.
CPUs generally don't care about flash, ram, peripherals, they are just addresses, the cpu is very very dumb. You the programmer are smart you lay the tracks down for the cpu to follow, the instructions have to follow the rules and guide the processor. The processor starts in a well known way at a well known address or vector table, from there it is all on you to keep the processor on track by working within the address space where there are resources, flash, ram and peripherals. The processor may have rules on the address space it can fetch/execute from, or not, depends on the implementation. For implementations where the executable address space has both flash and ram then yes you can simply place code in ram and execute it.
Running code in ram on an mcu is the exception not the rule.
Commonly a microcontroller does not load the (single) program into RAM. Instead it is run "in-place" in the (flash or any other non-volatile) memory. The program is built so that the memory at the (fixed) start address contains the startup code of the program.
Having said that you might wonder how (static) variables are initialized with zero and non-zero values. That is done by the startup code linked in when the program is built.
There is no need to add any "intelligence", assuming you mean something like a byte-code interpreter to execute the binary commands. The CPU of the microcontroller executes the machine code directly. And your compiler generates exactly the machine code.

How does Open MPI implement datatype conversion?

MPI standard states that when parallel programs are running on heterogenerous environment, they may have different representations for a same datatype(like big endian and small endian machines for intergers), so datatype representation conversion might be needed when doing point to point communication. I don't know how Open MPI implements this.
For instance, current Open MPI uses UCX library defaultly, I have study some codes of UCX library and Open MPI's ucx module. However, for continuous datatype like MPI_INT, I didn't find any representation conversion happen. I wonder is it because I miss that part or the implementation didn't satisfy the standard?
If you want to run an Open MPI app on an heterogeneous cluster, you have to configure --enable-heterogeneous (this is disabled by default). Keep in mind this is supposed to work, but it is lightly tested, mainly because of a lack of interest/real use cases. FWIW, IBM Power is now little endian, and Fujitsu is moving from Sparc to ARM for HPC, so virtually all HPC processors are (or will soon be) little endian.
Open MPI uses convertors (see opal/datatype/opal_convertor.h) to pack the data before sending it, and unpack it once received.
The data is packed in its current endianness. Data conversion (e.g. swap bytes) is performed by the receiver if the sender has a different endianness.
There are two ways of using UCX : pml/ucx and pml/ob1+btl/ucx and I have tested none of them in a heterogeneous environment. If you are facing some issues with pml/ucx, try mpirun --mca pml ob1 ....

OpenCL "cross"-compile x64 / 32-bit-pointer GPU

I'm trying to optimize my kernel functions and ran into a bit of an issue. First, this may be Radeon R9 (Hawaii) related, but it should happen for other GPU devices as well.
For the host I have two platform options. Either compile and run as an x86-program, or run as an x64-program. Depending which platform I chose, I get different compiled kernels. One that uses 32-bit pointers and pointer arithmetic, and the other that uses 64-bit pointers. The generated IL code shows the difference, in the first case it is
prog kernel &__OpenCL_execute_kernel(
kernarg_u32 %_.global_offset_0,
kernarg_u32 %_.global_offset_1,
...
and in the second case it is:
prog kernel &__OpenCL_execute_kernel(
kernarg_u64 %_.global_offset_0,
kernarg_u64 %_.global_offset_1,
...
64-bit arithmetic on a GPU is rather expensive and consumes a lot of additional VGPRs. In my case, the 64-bit pointer version requires 8 VGPRs more and has about 140 VALUInsts more, as shown by CodeXL. Performance overall is about 37% worse in my case between the slower 64-bit and the faster 32-bit kernel code. Which is, other than internal pointer arithmetic, completely identical. I have tried to optimize this, but even with plain offsets I'm still stuck with a lot of ADD_U64 IL-instructions, which in ISA-code produce two instructions: V_ADD_I32 and V_ADDC_U32. And of course all pointers require double private memory space (hence more VGPRs).
Now my question is: Is there a way to "cross"-compile an OpenCL kernel so a x64-program can create a 32-bit-pointer kernel? I don't need to address that much memory in the GPU, so addressing less than 4 GiB of memory space is fine. As my host is also executing AVX-512 instructions with all 32 zmm registers, which is only available in x64 mode, an x86-program is not an option. That makes the whole situation a bit challenging.
Well, my fallback solution is to spawn a x86-child process that uses shared memory and acts as a compiling gate. But I'd rather not do that if a simple flag or (AMD specific) setting in OpenCL does the trick.
Please don't reply with a why-that-is-response. I'm completely aware why the x64-program and kernel behave that way.
I've a couple ideas, but not being familiar with the guts of the AMD GPU OpenCL implementation, I am stabbing in the dark.
Can you pass the data in via an image (even if it's not)? On Intel GPUs going through the sampler provides a different path and can avoid 64-bit arithmetic even in the 64-bit version.
Does AMD have an extension that allows you to block read and write? This can help if the compiler proves that the address is uniform (scalar). E.g. something like Intel Subgroups (which enable some block IO). On Intel this helps avoid shipping a SIMD's worth of addresses across the bus for a scatter/gather (and saves register space too).
(This is a stretch.) Does compiling for OpenCL 1.2 or lower help? That is, specify -cl-std=CL1.2? If the compiler knows that SVM is not being used (>=OpenCL 2.0) and were to run a conservative analysis on the program to prove that it's not doing something wild with pointer arithmetic, it could feasibly do arithmetic in 32-bit and implicitly add a 64-bit relative offset to all addresses (making the GPU program think that it's using 32-bit addresses).
Again, I know nothing about AMD specifics, but I feel your pain with this problem.

what is the fasted way to communicate with serial port?

I would like to know which programing language give me faster response to communcate with serial port in window platform
Asm?
java?
vb
C#
.NET
BASH
etc.
Serial ports date from the stone-age of computing. That's where you plugged-in your ASR-33 teletype to start banging in your Fortran program. And wait a couple of hours to get your program compiled.
At a common baudrate of 9600, it takes a millisecond to transfer one byte. A modern processor easily executes at least 3 million instructions in that time span. Even the most mundane interpreted scripting language has no trouble keeping up with that. Such a processor also doesn't have a problem keeping up with a 10 gigabit/second network card. Common enough for Ethernet, way out of reach for serial ports.
The real problem with serial ports is that they are too slow. You cannot afford waiting for them, that would make your program too slow and unresponsive even for a human. So pick a language that makes asynchronous programming easy. Which can be very tricky to get right. Any .NET language is certainly a good candidate.

Microcontroller + Verilog/VHDL simulator?

Over the years I've worked on a number of microcontroller-based projects; mostly with Microchip's PICs. I've used various microcontroller simulators, and while they can be very helpful at times, I often find myself frustrated. In real life microcontrollers never exist alone and the firmware's behavior is dependent on the environment. However, none of the sims I've used provide decent support for anything outside the microcontroller.
My first thought was to model the entire board in Verilog. But, I'd rather not create an entire CPU model, and I haven't had much luck finding existing models for the chips I use. Regardless, I really don't need, or want, to simulate the proc at that level of detail, and I'd like to retain the debugging facilities provided by a regular processor sim.
It seems to me that the ideal solution would be a hybrid simulator that interfaces a traditional processor simulator with a Verilog model.
Does such a thing exist?
I've used the Altera Nios II processor embedded on a FPGA. Altera provides a toolchain for simulating the CPU (with its software) together with your custom logic in a simulator. I suppose that a similar setup can be achieved by downloading a VHDL/Verilog core of your CPU (Did you try opencores ? They have lots of stuff there).
But keep in mind that it is going to be mind-bogglingly slow, so don't expect to simulate whole complex processes this way. The best you can hope for is simulating fine software-hardware interaction points to debug problems. If you need a deeper simulation, consider running it on a FPGA with built-in monitoring code.
For the "simulate the whole board" approach,
The Free Model Foundry has a large number of models, some in VHDL others in Verilog, that are available now.. but you'll need to pay to have new models created. These are very helpful in being sure the board is built correctly.
But I think the more common approach when debugging your PIC is to just build a board, then work on the firmware. In the chip world, (where the firmware is running on a microprocessor in a chip that hasn't gone to fab yet) people often resort to very expensive systems (or renting time on them) that allow compiling part of the design into an emulator while the rest of the design runs in the normal simulator environment. Without the barrier of an expensive mask set for the chip, the cost is just not justifiable for a Circuit board. Although I've heard of some creative applications of Simulink (Mathworks) with FPGA's, but my recollection is that one either ran the system on the computer, or programmed the device and ran the same thing in realtime.
I believe both Cadence (ask about Palladium) and Mentor Graphics have that integrated solution if you have the money to spend on it.
What I have done recently is create an interface between the simulation environment and host system. Different hdl simulators have different interfaces, and getting the simulator NOT think in batch mode, the traditional simulation model, instead run for ever like a real design is half of the problem.
Then from the host using C (or whatever) you can create abstractions that may or may not allow you to write your application software for whatever target (depending on what language and compiler capabilities you have). For example you can make a generic poke and peek function and on the final target have those actually poke and peek memory or I/O, but for simulation through the abstraction you talk to a testbench in the simulation that simulates the same memory cycle.
I went one step further and used (Berkeley) sockets between the host and test bench so that the simulation can keep running while the host applications stop and start. Not unlike having a real processor with an OS that you are starting applications and running them to completion and starting another. At least for test applications, for delivery you probably only have one app.
By creating these abstraction layers I can write real applications that will be used on the target when it is built. Along the way you can use software simulation of the logic initially, then if you like build an fpga with an abstraction interface (throw away logic) say a uart for example. Replace the shim between the applications abstraction layer and the simulator with a uart interface, or whatever. Then when you marry the processor and logic in the same chip or on the same board, replace the abstraction layer again with direct calls to whatever interfaces they have always though they were talking to. If something breaks and you have retained the abstraction layer you can take the application back to the simulation model and have access to all of your logic internals.
Specifically this time around I am using a hdl language cyclicity cdl which is on sourceforge, the documentation needs some help but the examples may get you going, and it produces synthesizable verilog, so you get an extra win there. I threw out all the scripting batch stuff other than the bare minimum needed to connect and start a C simulation model. So my test bench is in C (well C++ technically) the sockets layer was done there. The output can be .vcd files which gtkwave uses. Basically you can do the bulk of your HDL design using open source software with no licenses, etc. By adding one or two lines of code to the CDL simulation part I was able to have it run as an infinite loop, which I can say works quite well, there doesnt appear to be any memory leaks, etc.
both modelsim and cadence have standardized ways of connecting host C programs to the simulation world and from there you can use an IPC to get to host applications talking to an abstraction layer api.
this is probably way overkill for a pic, I have given up pics a while ago for the faster and C friendly arm based micros anyway. There is/was an open core pic that you could simply incorporate into your simulation, even though that is not what you are trying to do here.
Not that I've seen. Your best bet is to properly define the interfaces and behavior between the uC and FPGA and then define a series of test waveforms that can be applied using an automated tester. You would have to make the automated tester (or perhaps a logic analyzer may have some such functionality) out of an FPGA or uC (apply waveform, watch interrupts, breakpoints, etc). If you really want I know that Opencores.org has PIC and AVR-like 8-bit uC cores defined as VHDL, so you could implement your entire project on the FPGA and then just debug that.
Generally there isn't need to model the CPU at the RTL level. Since you don't really care about what it does bit by bit; you generally care about what it does, e.g. register values, memories and bus access.
The simplest is call at Bus Functional Model. This just generates the read and writes that the CPU does, often based on a text file. These are available for some CPUs and many popular buses (e.g. PCI, PCIe). THese simulate super fast.
The next step up is a functional cycle-accurate model. Those simulate fast. They are often encrypted.
Last is a full RTL model. Those usually are only available if you are working closely with the CPU vendor, e.g. using their core in your ASIC. Typically these are encrypted, unless you are a huge company.
Memory models are typically cycle-accurate (e.g. Micron).
My workmates from the hardware department use FPGA simulation software quite often to find timing-bugs and trace down strange behaviours.
Simulating one or two milliseconds can take several hours, so using the simulator for anything but very small things is not feasable.
You may want to have a look at SystemC though. http://en.wikipedia.org/wiki/SystemC

Resources