Understanding low-level abstraction - abstraction

I have started programming in Java this year. I understand the high level concepts and feel comfortable programming.
However I seem to keep asking me how does all of this work internally? I understand that Java is a high-level language especially made to get the programmer away from low-level stuff to alleviate development.
In essence I would like to know more about how exactly high-level languages function internally (e.g. object oriented programming). It's clear to me why they are used, but now how everything works internally (memory allocation etc.). How are objects presented internally etc.
Can someone point me into the right direction with some keywords or preferably refer to some material? Would learning a low-level language like C or C++ help this learning process?

Based on the wording of your question, your low-level is still very high level.
Object oriented has nothing to do with highness nor lowness of the langauge, it just means orent on objects, you can have object oriented assembly. It is not a language thing basically any language can be used in an object oriented way.
Memory allocation is specific to the operating system and/or whomever is managing the memory. Nothing complicated there really at a high level. I have a pizza, and 3 people, I can cut that pizza up in 3 slices or 4 or 8 or whatever, each person can allocate one slice and there are some left over, they can come back and allocate more. Now freeing that pizza allocation after consumption is not something we want to visualize. But the idea is the same, you have some memory you want to allow a program to borrow/take. you divide it up, doesnt have to be all even sizes. you might offer various sizes 1K, 2K, 4K, 8K...1Meg units, etc. and multiples of those. you create a table/chart of who has consumed what, and what is left free. when then give it back you mark them free. Old school linear thinking can make this hard but MMUs (memory management units) make this easy. And that is low or lower level thinking. They are address translators along with protection features to prevent programs from accessing memory that isnt theirs.
An easy way to see what an MMU does for is from a memory allocation perspective is think of all the free to borrow/take memory is in units of 0x1000 bytes. Say starting at address 0x10000, so 0x10000, 0x11000, 0x12000 and so on. That is the physical address the actually memory side. But we can have a virtual address space as well. I may ask for 0x3000 bytes, and may be give a pointer 0x20000000. When I access between 0x20000000 and 0x20000FFF the mmu may translate that virtual address into physical address 0x00007000 to 0x00007FFF. But 0x20001000 to 0x20001FFF may translate to physical 0x00004000 to 0x00004FFF. And naturally 0x20002000 to some other physical address. So if someone allocates 10 blocks another allocates 3, the software that manages that allocation can give the first 10 physical blocks to the first program, and the next 3 after that to the next, if the first frees then someone allocates 7 the first 7 physical can be given to that new someone giving us a map of first 7 used, 3 free, and 3 used in a physical linear view. If someone now allocates 4 we can actually give them the 3 and another one at the end because we can map them in virtual space so they feel like they are accessing them linearly.
If I have a list of students listed alphabetically that doesnt mean that their dorm room numbers match linearly. alphabetically student number 1 on the list doesnt have to live in dorm room number 1. I have a table that maps their name to their dorm room. If we add a student in the middle of the list alphabetically, doesnt mean that we have to shuffle all the dorm room numbers, we just need a table. So someone can be given 5 names out of the alphabetical list to work on a project, that doesnt mean they are in 5 adjacent dorm rooms, when needing to talk to each of those five students we can use a table of name to dorm room to find them. Virtual address is the alphabetical list, physical address is the dorm room those folks live in. manage the tables and a program can access what it thinks is linear memory space, but is really just fragments spread about. You dont have to "defrag" memory as it is allocated and freed. Without an mmu, it gets very messy.
Low level stuff that a high level language avoids is the nuances of the processor. I can go through the drive through and order a burger, or I can go buy buns, meat, pickles, tomatoes, lettuce, ketchup, etc and then cook and assemble a burger myself. a = b + c in a high level language can end up being a number of memory and/or register accesses to save one or more registers to the stack so you can free up registers to gather up those values where they are stored in memory (if not already in said registers) to perform the operation, the now or later save the result to memory as needed. system calls like printing or file access or network or video, etc, tons of code doing small individual tasks to make the whole. All the bricks and boards and nails and cement and such that it takes to make a building, like a burger, can just buy a house that someone (the compiler) built, or I can buy five zillion tools and materials and construct that house shaping and combing those materials in the right order.
The high level language gives you abstraction as well. This is C but I bet you can understand it.
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+b+7);
}
I can compile it into its pickles and lettuce and bun ingredients along with the knives and frying pans that put it all together:
00000000 <fun>:
0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
4: e28db000 add fp, sp, #0
8: e24dd00c sub sp, sp, #12
c: e50b0008 str r0, [fp, #-8]
10: e50b100c str r1, [fp, #-12]
14: e51b2008 ldr r2, [fp, #-8]
18: e51b300c ldr r3, [fp, #-12]
1c: e0823003 add r3, r2, r3
20: e2833007 add r3, r3, #7
24: e1a00003 mov r0, r3
28: e24bd000 sub sp, fp, #0
2c: e49db004 pop {fp} ; (ldr fp, [sp], #4)
30: e12fff1e bx lr
I can be a lot more efficient McDonalds instead of a greasy spoon diner:
00000000 <fun>:
0: e2811007 add r1, r1, #7
4: e0810000 add r0, r1, r0
8: e12fff1e bx lr
Or I can use the same code on a completely different computer:
00000000 <_fun>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 1d40 0006 mov 6(r5), r0
8: 65c0 0007 add $7, r0
c: 6d40 0004 add 4(r5), r0
10: 1585 mov (sp)+, r5
12: 0087 rts pc
And yes with the right tools (gnu works just fine) you can easily take C/C++ and start to see the above and try to understand it. What the language is doing for you. When it comes to system calls like printf or file access, etc. The application calls library functions which are other code linked in, and those eventually ask the operating system to go do that task (using your credit card to buy that burger rather than cash, the cashier now has to swipe the card in a box, the box talks to banks somewhere in the world please do this transaction for me, rater than opening a drawer and the cashier takes care of it). Adding a couple of numbers usually doesnt involve the operating system, but accessing a controlled or complicated or shared resource like video or disk, etc, you have to ask the operating system to that for you and that is language, compiler and operating system specific.
Java and python (early pascal, etc) abstract that by compiling to a machine code that is not actually implemented nor implementable in hardware directly. Then having a platform and operating specific virtual machine (written in some other language like C) that reads those java bytecodes then performs that task, some of the tasks being push b, push c, add (a), and some being go read a file. It is possible to disassemble and see what JAVA is producing at the bytecode level, but easier to do with compiled languages.
javiergarval's answer the Tanenbaum book(s) or ones like it may cover what you are after initially the middle layer, the operating system. But depending how low you want to go, gets down into assembly language then further down into logic and busses.
You might consider the book
Code: The Hidden Language of Computer Hardware and Software
by Petzold. To come from the other direction.

A good book I always recommend is Modern Operating Systems by Andrew S. Tanenbaum.
It covers all the how to's you wonder when programming.
Also this thread from SO.

Related

Why Motorola 68k's 32-bit general-purpose registers are divided into data registers and address registers?

The 68k registers are divided into two groups of eight. Eight data registers (D0 to D7) and eight address registers (A0 to A7). What is the purpose of this separation, would not be better if united?
The short answer is, this separation comes from the architecture limitations and design decisions made at the time.
The long answer:
The M68K implements quite a lot of addressing modes (especially when compared with the RISC-based processors), with many of its instructions supporting most (if not all) of them. This gives a large variety of addressing modes combinations within every instruction.
This also adds a complexity in terms of opcode execution. Take the following example:
move.l $10(pc), -$20(a0,d0.l)
The instruction is just to copy a long-word from one location to another, simple enough. But in order to actually perform the operation, the processor needs to figure out the actual (raw) memory addresses to work with for both source and destination operands. This process, in which operands addressing modes are decoded (resolved), is called the effective address calculation.
For this example:
In order to calculate the source effective address - $10(pc),
the processor loads the value of PC (program) counter register
and adds $10 to it.
In order to calculate the destination effective address -
-$20(a0,d0.l), the processor loads the value of A0 register, adds the value of D0 register to it, then subtracts
$20.
This is quite a lot of calculations of a single opcode, isn't it?
But the M68K is quite fast in performing these calculations. In order to calculate effective addresses quickly, it implements a dedicated Address Unit (AU).
As a general rule, operations on data registers are handled by the ALU (Arithmetic Logical Unit) and operations involving address calculations are handled by the AU (Address Unit).
The AU is well optimized for 32-bit address operations: it performs 32-bit subtraction/addition within one bus cycle (4 CPU ticks), which ALU doesn't (it takes 2 bus cycles for 32-bit operations).
However, the AU is limited to just load and basic addition/subtraction operations (as dictated by the addressing modes), and it's not connected to the CCR (Conditional Codes Register), which is why operations on address registers never update flags.
That said, the AU should've been there to optimize calculation of complex addressing modes, but it just couldn't replace the ALU completely (after all, there were only about 68K transistors in the M68K), hence there are two registers set (data and address registers) each having their own dedicated unit.
So this is just based on a quick lookup, but using 16 registers is obviously easier to program. The problem could be that you would then have to make instructions for each of the 16 registers. Which would double the number of opcodes needed. Using half for each purpose is not ideal but gives access to more registers in general.

What is the difference when programming microcontroller 8-bit, 16-bit, 32-bit?

People always want an engineer who can program/have experience in both 8, 16 and 32 bit controllers. I still cannot figure out is there a huge difference when we program 8, 16, 32 bit microcontroller? Or it's just different in terms of declaring variables ...
Similar to Weather Vane's comment. Is there a difference driving a wee bitty compact car, a pickup truck, a moving van and a semi truck (tractor-trailer)? Well, in a lot of respects they are the same, gas, brakes, steering wheel, seat, door, window, radio, headlights, all that stuff, you brake and gas and steer. Could you move your house with a smart car? Well not everything, although you could maybe get a small trailer for the bed and couch, but you could move a box or two at a time and make a ton of trips, the pickup fewer trips, the semi maybe one trip depends on how much stuff you have.
If you are programming high level like C then initially it will feel the same, gas, brake, steer. But it is like taking many trips to move your stuff or a few or one. You can in general do 64 bit math with an 8 bit processor, a 16 and we know a 32 because we do that all the time and naturally a 64 bit. It just takes more trips, you have to break the math down into parts and do the parts one at a time. And naturally like using your fiat to move the stuff in your apartment it takes a lot more time than using a truck.
Just saw someone here using floating point on a Microchip PIC, see that all too often (float on a microcontroller), if the compiler has a library for it it allows folks to try it, but they very quickly run out of flash and/or ram and their performance is dreadful.
So in some respects it is about knowing that variables are not the same sizes you are used to on windows or linux, that things take longer, you have a lot fewer resources, debugging is or can be quite a bit different, its like getting out of your VW bug which is the only car you have ever driven then getting in a pickup truck or moving van, and clipping the curbs or parked cars every time you turn, not being able to stop at the lights, etc. Eventually you get used to it. Unlike the moving truck, you usually cant hurt the bits in the processor when you screw up, you can let smoke out with software and brick a system sure, but hopefully more often the program doesnt build or doesnt run or runs really slow and you dont have to go to jail each time for running over a pedestrian standing on the corner.
In this day and age (and actually for a long time but very trivial now) you can find all kinds of simulators/emulators that you can try to get a feel for these platforms. The performance thing may be incorrect, but if you only have 1024 bytes of ram and a few Kbytes of flash, you are going to feel that right away, and wonder why only a few lines of high level code consume so many instructions. Understanding things like maybe on that 8 bit mcu I should use mostly 8 bit variables so I dont burn so much code (unless the compiler figures it out, still wasteful), likewise on the 16 use mostly 16 bit and the 32 mostly 32 bit. (trying to "conserve" memory on a larger platform can actually cost you more cycles having to sign extend or clip every operation if it doesnt have native instructions for that or adding extra instructions to do that sign extension or clipping).

Best way to shuffle 64-bit portions of two __m128i's

I have two __m128is, a and b, that I want to shuffle so that the upper 64 bits of a fall in the lower 64 bits of dst and the lower 64 bits of b fall in the upper 64 of dst. i.e.
dst[ 0:63] = a[64:127]
dst[64:127] = b[0:63]
Equivalent to:
__m128i dst = _mm_unpacklo_epi64(_mm_srli_si128i(a, 8), b);
or
__m128i dst = _mm_castpd_si128(mm_shuffle_pd(_mm_castsi128_pd(a),_mm_castsi128_pd(b),1));
Is there a better way to do this than the first method? The second one is just one instruction, but the switch to the floating point SIMD execution is more costly than the extra instruction from the first.
Latency isn't always the worst thing ever. If it's not part of a loop-carried dep-chain, then just use the single instruction.
Also, there might not be any! Agner Fog's microarch doc says he found no extra latency in some cases when using the "wrong" type of shuffle or boolean, on Sandybridge. Blends still have the extra latency. On Haswell, he says there are no extra delays at all for mixing types of shuffle. (pg 140, Data Bypass Delays.)
So go ahead and use shufps, unless you care a lot about your code being fast on Nehalem. (Previous designs (merom/conroe, and Penryn) didn't have extra bypass delays for using the wrong move or shuffle.)
For AMD, shufps runs in the ivec domain, same as integer shuffles, so it's fine to use it. Like Intel, FP blends run in the FP domain, and thus have no bypass delay for FP data.
If you include multiple asm versions depending on which instruction sets are supported, without going completely nuts about having the optimal version of everything for every CPU like x264 does, you might use wrong-type ops in your version for AVX CPUs, but use multiple instructions in your non-AVX version. Nehalem has large penalties (2 cycle bypass delays for each domain transition), while Sandybridge is 0 or 1 cycle. SnB is the first generation with AVX.
Pre-Nehalem (no SSE4.2) is so old that it's probably not worth tuning a version specifically for it, even though it doesn't have any penalties for "wrong type" shuffles. Nehalem is right on the cusp of being kinda slow, so software running on those systems will have the hardest time operating in real-time, or not feeling slow. Thus, being bad on Nehalem would add to a bad user experience since their system is already not the fastest.

Why are 32-bit registers divided into 4 parts?

I'm learning about registers. It looks like 32-bit registers are divided up so that they can be accessed as 8-bit registers. This looks very inefficient. Performance would be improved if they didn't do this. So why do they do it?
Also, it costs extra money to design them like this. Why not make the CPU cheaper by not doing it?
Because if you're only dealing with 8bit values, it'd be inefficient to have issue all the bitmasks to limit those 32/64bit register to just the 8bits you're working on.
So, x86 registers have
AH/AL = high/low 8bits of a 16bit register
AX = whole 16bit register
EAX = whole 32bit register
It's far more efficient, in terms of instruction size to have
mov ah, 0xXX (2 bytes)
rather than forcing
mov ax, 0x00XX (3 bytes)
mov eax, 0x000000XX (7 bytes)
As for "designing the cpu to make it cheaper" - it's for backwards compatibility. All modern x86 processors are actually internally a RISC design, with a major chunk of silicon dedicated to taking the x86 instructions coming in and converting them into the CPU's own internal micro-ops (which is basically a RISC instruction set).
The Intel 8080, which was the first "mainstream" microprocessor, had seven main 8-bit registers (A, B, C, D, E, H, and L). Because memory addresses were 16 bits, instructions that needed to use a non-constant memory operand would use a pair of registers (most commonly H and L, but sometimes B and C, or D and E) to form the address. Because the registers in the aforementioned pairs were often used together to represent 16-bit values, there were a few instructions which could operate upon the register pairs as 16-bit quantities. An instruction to add BC to HL would perform the addition by adding C to L, and then by adding B to H (plus a carry if needed). I'm not familiar enough with the 4004 or 8008 (the two predecessors of the 8080) to know if either of them did anything similar in its architecture.
When Intel produced the 8088, they included a full 16-bit arithmetic unit, but they wanted code which was written for the 8080 to be easily convertible to their new architecture. On the 8080, a lot of code had been written to "manually" form addresses out of the 8-bit parts, since doing so was often much faster than using the 16-bit instructions to do the math. For example, if one needed to access some specified table of 256 entries with an index stored in A, one could have done something like (Zilog notation show, but the 8080 had the same instructions):
ld hl,(baseOfTable) ; 16-bit address
ld c,a
ld b,#0
add hl,bc
ld a,(hl)
but if one could make certain the table was aligned on a 256-byte boundary, one could simplify the code considerably:
ld l,a
ld a,(tableBaseMSB) ; Just load the MSB--assume the LSB is zero
ld h,a
ld a,(hl)
With the 8088 instruction set, it wouldn't terribly often be useful for code written "from scratch" to access the upper and lower parts of registers separately, but there was a lot of code written for the 8080 which used such techniques, and Intel wanted to make it easy for people to convert such code for use on the 8088. Allowing registers to be built from 8-bit pieces was helpful in that regard.
Incidentally, there was another advantage to Intel's architecture: since it included four 16-bit only registers and four registers which could be used as either one 16-bit or two 8-bit parts, that made it possible for code to hold 12 values in registers if eight of them were 255 or less, or eleven values if six of them were 256 or less, etc. When using architectures with more registers, eking out an extra register here and there isn't quite so important, but on the 8088 it was often very helpful.
The ability to address portions of the registers has no effect on their performance when used as 32-bit registers. In that case, this capability just isn't used.
CPUs, regardless of their native bit size, need to manipulate 8-bit values very, very often. Strings of text, for example, are frequently manipulated as consecutive 8-bit values. International character sets are often manipulated as sets of consecutive 16-bit values. So being able to operate rapidly on 8-bit and 16-bit values is of tremendous importance.
If you're asking as a practical matter for x86 CPUs, it's too late. The very first PC CPUs didn't even have 32-bit registers, and compatibility has been retained all the way through.
Backwards compatibility. Processor manufacturers did not wanted to break compatibility with old software. This is the main reason why x86_64 processors still support 16bit software(virtual mode). If you look closely you'll see that majority of the features in x86 architecture are shaped by compatibility concerns. I'm no hating.

Moving data from memory to memory in micro controllers

Why can't we move data directly from a memory location into another memory location.
Pardon me if I am asking a dumb question, but I think this is a true situation, at least for the ones I've encountered (8085,8086 n 80386)
I am not really looking for a solution for moving the data (like for eg, using movs n all), but actually the reason for this anomaly.
What about MOVS? It moves a 8/16/32-bit value addressed by esi to the location addressed by edi.
The basic reason is that most instruction sets allow one register operand, and one memory operand, and sticking to this format makes designing the instruction decoder easier. It also makes the execution engine inside the CPU easier, because the instruction can issue typically a memory operation to just one memory location, and at most one register block read or write.
To do a memory-to-memory instruction directly requires two memory locations to be designated. This is awkward given a register/memory instruction format. Given the performance of the machines, there is little justification for modifying the instruction format just for this.
A hack used by more modern CPUs is to provide some type of block-move instruction, in which the source and destination locations are located in registers (for the X86 this is ESI and EDI respectively). Then an instruction can just designate two registers (or in the case of the x86, instructions that simply know which registers). That solves the instruction decoding problem.
The instruction execution problem is a little harder but people have lots of transistors. Organizing a read indirect from one register, and write indirect through another, and increment both is awkward in silicon but that just chews up some transistors.
Now you can have an instruction that moves from memory to memory, just as you asked.
One of the other posters noted for the X86 there are instrucitons (MOVB, MOVW, MOVS, ...) that do exactly this, one memory byte/word/... at a time.
Moving a block of memory would be ideal because the CPU can generate high-bandwith reads and writes. The x86 does this with with a REP (repeat) prefix on MOV- to move a larger block.
But if a single insturction can do this, you have the problem that it might take a long time to execute (how long to move 1Gb? --> millions of clock cycles!) and that ruins the interrupt response rate of the CPU.
The x86 solves this by allowing REP MOV- to be interrupted, with the PC being set back to the beginning of the instruction. By updating the registers during the move appropriately, you can interrupt and restart the REP MOV- instruction having both a fast block move and high interrupt response rates. More transistors down the tube.
The RISC guys figured out that all this complexity for a block move instruction was mostly not worth it. You can code a dumb loop (even the x86):
copy: MOV EAX,[ESI]
ADD ESI,4
MOV [EDI],EAX
ADD EDI,4
DEC ECX
JNE copy
which does the same basic thing as REP MOV- . Pretty much the modern CPUs (x86, others) execute this so fast (superscalar, etc.) that the bus is just as utilized as the custom move instruction, but now you don't need all those wasted transistors (or corresponding heat).
Most CPU varieties don't allow memory-to-memory moves. Normally the CPU can access only one memory location at at time, which means you need a temporary spot to store the value when moving it (a general purpose register, usually). If you think about it, moving directly from one memory location to another would require that the CPU be able to access two different spots in RAM simultaneously - that means two full memory controllers at least, and even then, the chances they'd "play nice" enough to access the same RAM would be pretty bad. The chip designers might have been able to pull some tricks to allow direct copies from one RAM chip to another, but that would be a pretty special-application kind of feature that would just add cost and complexity to solve a very uncommon problem.
You might be able to use some special DMA hardware to make it look to your program like memory is being moved without that temporary storage, at least from the perspective of your CPU.
You have one set of address lines, one set of data lines, and a few control lines between the CPU and RAM. You can't physically move directly from memory to memory without a second set of address lines and a whole bunch of complicated logic inside the RAM. Therefore, we have to store it temporarily in a register.
You could make an instruction that does the load and store together and looks like one instruction to the programmer, but there are other considerations like instruction size, non-duplication of effective address calculation logic, pipelining, etc. that make it desirable to keep it more simple.
Memory-memory machines turn out to be slower in general than load-store machines. This was deduced/figured out/invented by the RISC researchers in 1980ish or so. So the older architectures (VAX/OS360) tend to have memory-memory architectures; newer machines do load-store.
Another interesting variant is stack machines; they seem to always be around as a minority.

Resources